imjunwei / ocr-daily-paper Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 0.0 0 B

daily update latest arxiv paper about ocr

ocr-daily-paper's Introduction

imjunwei

ocr-daily-paper's People

Contributors

Stargazers

Watchers

ocr-daily-paper's Issues

New submissions for Fri, 20 Aug 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

End-to-End License Plate Recognition Pipeline for Real-time Low Resource Video Based Applications

Authors: Alif Ashrafee, Akib Mohammed Khan, Mohammad Sabik Irbaz, MD Abdullah Al Nasim
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2108.08339
Pdf link: https://arxiv.org/pdf/2108.08339
Abstract
Automatic License Plate Recognition systems aim to provide an end-to-end solution towards detecting, localizing, and recognizing license plate characters from vehicles appearing in video frames. However, deploying such systems in the real world requires real-time performance in low-resource environments. In our paper, we propose a novel two-stage detection pipeline paired with Vision API that aims to provide real-time inference speed along with consistently accurate detection and recognition performance. We used a haar-cascade classifier as a filter on top of our backbone MobileNet SSDv2 detection model. This reduces inference time by only focusing on high confidence detections and using them for recognition. We also impose a temporal frame separation strategy to identify multiple vehicle license plates in the same clip. Furthermore, there are no publicly available Bangla license plate datasets, for which we created an image dataset and a video dataset containing license plates in the wild. We trained our models on the image dataset and achieved an AP(0.5) score of 86% and tested our pipeline on the video dataset and observed reasonable detection and recognition performance (82.7% detection rate, and 60.8% OCR F1 score) with real-time processing speed (27.2 frames per second).

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Wed, 20 Oct 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

BlockIoT: Blockchain-based Health Data Integration using IoT Devices

Authors: Manan Shukla, Jianjing Lin, Oshani Seneviratne
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2110.10123
Pdf link: https://arxiv.org/pdf/2110.10123
Abstract
The development and adoption of Electronic Health Records (EHR) and health monitoring Internet of Things (IoT) Devices have enabled digitization of patient records and has also substantially transformed the healthcare delivery system in aspects such as remote patient monitoring, healthcare decision making, and medical research. However, data tends to be fragmented among health infrastructures and prevents interoperability of medical data at the point of care. In order to address this gap, we introduce BlockIoT that uses blockchain technology to transfer previously inaccessible and centralized data from medical devices to EHR systems, which provides greater insight to providers who can, in turn, provide better outcomes for patients. This notion of interoperability of medical device data is possible through an Application Programming Interface (API), which serves as a versatile endpoint for all incoming medical device data, a distributed file system that ensures data resilience, and knowledge templates that analyze, identify, and represent medical device data to providers. Our participatory design survey on BlockIoT demonstrates that BlockIoT is a suitable system to supplement physicians' clinical practice and increases efficiency in most healthcare specialties, including cardiology, pulmonology, endocrinology, and primary care.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Fri, 3 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

Scene Text recognition with Full Normalization

Authors: Nathan Zachary, Gerald Carl, Russell Elijah, Hessi Roma, Robert Leer, James Amelia
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2109.01034
Pdf link: https://arxiv.org/pdf/2109.01034
Abstract
Scene text recognition has made significant progress in recent years and has become an important part of the work-flow. The widespread use of mobile devices opens up wide possibilities for using OCR technologies in everyday life. However, lack of training data for new research in this area remains relevant. In this article, we present a new dataset consisting of real shots on smartphones and demonstrate the effectiveness of profile normalization in this task. In addition, the influence of various augmentations during the training of models for analyzing document images on smartphones is studied in detail. Our dataset is publicly available.

Keyword: OCR

Scene Text recognition with Full Normalization

Authors: Nathan Zachary, Gerald Carl, Russell Elijah, Hessi Roma, Robert Leer, James Amelia
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2109.01034
Pdf link: https://arxiv.org/pdf/2109.01034
Abstract
Scene text recognition has made significant progress in recent years and has become an important part of the work-flow. The widespread use of mobile devices opens up wide possibilities for using OCR technologies in everyday life. However, lack of training data for new research in this area remains relevant. In this article, we present a new dataset consisting of real shots on smartphones and demonstrate the effectiveness of profile normalization in this task. In addition, the influence of various augmentations during the training of models for analyzing document images on smartphones is studied in detail. Our dataset is publicly available.

Keyword: Handwriting

There is no result

Keyword: Scene Text

Scene Text recognition with Full Normalization

Authors: Nathan Zachary, Gerald Carl, Russell Elijah, Hessi Roma, Robert Leer, James Amelia
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2109.01034
Pdf link: https://arxiv.org/pdf/2109.01034
Abstract
Scene text recognition has made significant progress in recent years and has become an important part of the work-flow. The widespread use of mobile devices opens up wide possibilities for using OCR technologies in everyday life. However, lack of training data for new research in this area remains relevant. In this article, we present a new dataset consisting of real shots on smartphones and demonstrate the effectiveness of profile normalization in this task. In addition, the influence of various augmentations during the training of models for analyzing document images on smartphones is studied in detail. Our dataset is publicly available.

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Wed, 25 Aug 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

Meta Self-Learning for Multi-Source Domain Adaptation: A Benchmark

Authors: Shuhao Qiu, Chuang Zhu, Wenli Zhou
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2108.10840
Pdf link: https://arxiv.org/pdf/2108.10840
Abstract
In recent years, deep learning-based methods have shown promising results in computer vision area. However, a common deep learning model requires a large amount of labeled data, which is labor-intensive to collect and label. What's more, the model can be ruined due to the domain shift between training data and testing data. Text recognition is a broadly studied field in computer vision and suffers from the same problems noted above due to the diversity of fonts and complicated backgrounds. In this paper, we focus on the text recognition problem and mainly make three contributions toward these problems. First, we collect a multi-source domain adaptation dataset for text recognition, including five different domains with over five million images, which is the first multi-domain text recognition dataset to our best knowledge. Secondly, we propose a new method called Meta Self-Learning, which combines the self-learning method with the meta-learning paradigm and achieves a better recognition result under the scene of multi-domain adaptation. Thirdly, extensive experiments are conducted on the dataset to provide a benchmark and also show the effectiveness of our method. The code of our work and dataset are available soon at https://bupt-ai-cz.github.io/Meta-SelfLearning/.

Keyword: OCR

L1-regularized neural ranking for risk stratification and its application to prediction of time to distant metastasis in luminal node negative chemotherapy naïve breast cancer patients

Authors: Fayyaz Minhas, Michael S. Toss, Noor ul Wahab, Emad Rakha, Nasir M. Rajpoot
Subjects: Machine Learning (cs.LG); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/2108.10365
Pdf link: https://arxiv.org/pdf/2108.10365
Abstract
Can we predict if an early stage cancer patient is at high risk of developing distant metastasis and what clinicopathological factors are associated with such a risk? In this paper, we propose a ranking based censoring-aware machine learning model for answering such questions. The proposed model is able to generate an interpretable formula for risk stratifi-cation using a minimal number of clinicopathological covariates through L1-regulrization. Using this approach, we analyze the association of time to distant metastasis (TTDM) with various clinical parameters for early stage, luminal (ER+ or HER2-) breast cancer patients who received endocrine therapy but no chemotherapy (n = 728). The TTDM risk stratification formula obtained using the proposed approach is primarily based on mitotic score, histolog-ical tumor type and lymphovascular invasion. These findings corroborate with the known role of these covariates in increased risk for distant metastasis. Our analysis shows that the proposed risk stratification formula can discriminate between cases with high and low risk of distant metastasis (p-value < 0.005) and can also rank cases based on their time to distant metastasis with a concordance-index of 0.73.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Fri, 27 Aug 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

StackMix and Blot Augmentations for Handwritten Text Recognition

Authors: Alex Shonenkov, Denis Karachev, Maxim Novopoltsev, Mark Potanin, Denis Dimitrov
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2108.11667
Pdf link: https://arxiv.org/pdf/2108.11667
Abstract
This paper proposes a handwritten text recognition(HTR) system that outperforms current state-of-the-artmethods. The comparison was carried out on three of themost frequently used in HTR task datasets, namely Ben-tham, IAM, and Saint Gall. In addition, the results on tworecently presented datasets, Peter the Greats manuscriptsand HKR Dataset, are provided.The paper describes the architecture of the neural net-work and two ways of increasing the volume of train-ing data: augmentation that simulates strikethrough text(HandWritten Blots) and a new text generation method(StackMix), which proved to be very effective in HTR tasks.StackMix can also be applied to the standalone task of gen-erating handwritten text based on printed text.

Keyword: OCR

LayoutReader: Pre-training of Text and Layout for Reading Order Detection

Authors: Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, Furu Wei
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2108.11591
Pdf link: https://arxiv.org/pdf/2108.11591
Abstract
Reading order detection is the cornerstone to understanding visually-rich documents (e.g., receipts and forms). Unfortunately, no existing work took advantage of advanced deep learning models because it is too laborious to annotate a large enough dataset. We observe that the reading order of WORD documents is embedded in their XML metadata; meanwhile, it is easy to convert WORD documents to PDFs or images. Therefore, in an automated manner, we construct ReadingBank, a benchmark dataset that contains reading order, text, and layout information for 500,000 document images covering a wide spectrum of document types. This first-ever large-scale dataset unleashes the power of deep neural networks for reading order detection. Specifically, our proposed LayoutReader captures the text and layout information for reading order prediction using the seq2seq model. It performs almost perfectly in reading order detection and significantly improves both open-source and commercial OCR engines in ordering text lines in their results in our experiments. We will release the dataset and model at \url{https://aka.ms/readingbank}.

Mining Contextual Information Beyond Image for Semantic Segmentation

Authors: Zhenchao Jin, Tao Gong, Dongdong Yu, Qi Chu, Jian Wang, Changhu Wang, Jie Shao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2108.11819
Pdf link: https://arxiv.org/pdf/2108.11819
Abstract
This paper studies the context aggregation problem in semantic image segmentation. The existing researches focus on improving the pixel representations by aggregating the contextual information within individual images. Though impressive, these methods neglect the significance of the representations of the pixels of the corresponding class beyond the input image. To address this, this paper proposes to mine the contextual information beyond individual images to further augment the pixel representations. We first set up a feature memory module, which is updated dynamically during training, to store the dataset-level representations of various categories. Then, we learn class probability distribution of each pixel representation under the supervision of the ground-truth segmentation. At last, the representation of each pixel is augmented by aggregating the dataset-level representations based on the corresponding class probability distribution. Furthermore, by utilizing the stored dataset-level representations, we also propose a representation consistent learning strategy to make the classification head better address intra-class compactness and inter-class dispersion. The proposed method could be effortlessly incorporated into existing segmentation frameworks (e.g., FCN, PSPNet, OCRNet and DeepLabV3) and brings consistent performance improvements. Mining contextual information beyond image allows us to report state-of-the-art performance on various benchmarks: ADE20K, LIP, Cityscapes and COCO-Stuff.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Mon, 20 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

There is no result

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Wed, 1 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

Thermostat: A Large Collection of NLP Model Explanations and Analysis Tools

Authors: Nils Feldhus, Robert Schwarzenberg, Sebastian Möller
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2108.13961
Pdf link: https://arxiv.org/pdf/2108.13961
Abstract
In the language domain, as in other domains, neural explainability takes an ever more important role, with feature attribution methods on the forefront. Many such methods require considerable computational resources and expert knowledge about implementation details and parameter choices. To facilitate research, we present Thermostat which consists of a large collection of model explanations and accompanying analysis tools. Thermostat allows easy access to over 200k explanations for the decisions of prominent state-of-the-art models spanning across different NLP tasks, generated with multiple explainers. The dataset took over 10k GPU hours (> one year) to compile; compute time that the community now saves. The accompanying software tools allow to analyse explanations instance-wise but also accumulatively on corpus level. Users can investigate and compare models, datasets and explainers without the need to orchestrate implementation details. Thermostat is fully open source, democratizes explainability research in the language domain, circumvents redundant computations and increases comparability and replicability.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Thu, 23 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

There is no result

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Wed, 20 Oct 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

BlockIoT: Blockchain-based Health Data Integration using IoT Devices

Authors: Manan Shukla, Jianjing Lin, Oshani Seneviratne
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2110.10123
Pdf link: https://arxiv.org/pdf/2110.10123
Abstract
The development and adoption of Electronic Health Records (EHR) and health monitoring Internet of Things (IoT) Devices have enabled digitization of patient records and has also substantially transformed the healthcare delivery system in aspects such as remote patient monitoring, healthcare decision making, and medical research. However, data tends to be fragmented among health infrastructures and prevents interoperability of medical data at the point of care. In order to address this gap, we introduce BlockIoT that uses blockchain technology to transfer previously inaccessible and centralized data from medical devices to EHR systems, which provides greater insight to providers who can, in turn, provide better outcomes for patients. This notion of interoperability of medical device data is possible through an Application Programming Interface (API), which serves as a versatile endpoint for all incoming medical device data, a distributed file system that ensures data resilience, and knowledge templates that analyze, identify, and represent medical device data to providers. Our participatory design survey on BlockIoT demonstrates that BlockIoT is a suitable system to supplement physicians' clinical practice and increases efficiency in most healthcare specialties, including cardiology, pulmonology, endocrinology, and primary care.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Wed, 29 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

Fake or Credible? Towards Designing Services to Support Users' Credibility Assessment of News Content

Authors: Enrico Bunde, Niklas Kühl, Christian Meske
Subjects: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2109.13336
Pdf link: https://arxiv.org/pdf/2109.13336
Abstract
Fake news has become omnipresent in digitalized areas such as social media platforms. While being disseminated online, it also poses a threat to individuals and societies offline, for example, in the context of democratic elections. Research and practice have investigated the detection of fake news with behavioral science or method-related perspectives. However, to date, we lack design knowledge on presenting fake news warnings to users to support their individual news credibility assessment. We present the journey through the first design cycle on developing a fake news detection service focusing on the user interface design. The design is grounded in concepts from the field of source credibility theory and instantiated in a prototype that was qualitatively evaluated. The 13 participants communicated their interest in a lightweight application that aids in the news credibility assessment and rated the design features as useful as well as desirable.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Thu, 16 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

Searching for Representation: A sociotechnical audit of googling for members of U.S. Congress

Authors: Emma Lurie, Deirdre K. Mulligan
Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2109.07012
Pdf link: https://arxiv.org/pdf/2109.07012
Abstract
High-quality online civic infrastructure is increasingly critical for the success of democratic processes. There is a pervasive reliance on search engines to find facts and information necessary for political participation and oversight. We find that approximately 10% of the top Google search results are likely to mislead California information seekers who use search to identify their congressional representatives. 70% of the misleading results appear in featured snippets above the organic search results. We use both qualitative and quantitative methods to understand what aspects of the information ecosystem lead to this sociotechnical breakdown. Factors identified include Google's heavy reliance on Wikipedia, the lack of authoritative, machine parsable, high accuracy data about the identity of elected officials based on geographic location, and the search engine's treatment of under-specified queries. We recommend steps that Google can take to meet its stated commitment to providing high quality civic information, and steps that information providers can take to improve the legibility and quality of information about congressional representatives available to search algorithms.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Thu, 19 Aug 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

AdapterHub Playground: Simple and Flexible Few-Shot Learning with Adapters

Authors: Tilman Beck, Bela Bohlender, Christina Viehmann, Vincent Hane, Yanik Adamson, Jaber Khuri, Jonas Brossmann, Jonas Pfeiffer, Iryna Gurevych
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2108.08103
Pdf link: https://arxiv.org/pdf/2108.08103
Abstract
The open-access dissemination of pretrained language models through online repositories has led to a democratization of state-of-the-art natural language processing (NLP) research. This also allows people outside of NLP to use such models and adapt them to specific use-cases. However, a certain amount of technical proficiency is still required which is an entry barrier for users who want to apply these models to a certain task but lack the necessary knowledge or resources. In this work, we aim to overcome this gap by providing a tool which allows researchers to leverage pretrained models without writing a single line of code. Built upon the parameter-efficient adapter modules for transfer learning, our AdapterHub Playground provides an intuitive interface, allowing the usage of adapters for prediction, training and analysis of textual data for a variety of NLP tasks. We present the tool's architecture and demonstrate its advantages with prototypical use-cases, where we show that predictive performance can easily be increased in a few-shot learning scenario. Finally, we evaluate its usability in a user study. We provide the code and a live interface at https://adapter-hub.github.io/playground.

End-to-End Urban Driving by Imitating a Reinforcement Learning Coach

Authors: Zhejun Zhang, Alexander Liniger, Dengxin Dai, Fisher Yu, Luc Van Gool
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2108.08265
Pdf link: https://arxiv.org/pdf/2108.08265
Abstract
End-to-end approaches to autonomous driving commonly rely on expert demonstrations. Although humans are good drivers, they are not good coaches for end-to-end algorithms that demand dense on-policy supervision. On the contrary, automated experts that leverage privileged information can efficiently generate large scale on-policy and off-policy demonstrations. However, existing automated experts for urban driving make heavy use of hand-crafted rules and perform suboptimally even on driving simulators, where ground-truth information is available. To address these issues, we train a reinforcement learning expert that maps bird's-eye view images to continuous low-level actions. While setting a new performance upper-bound on CARLA, our expert is also a better coach that provides informative supervision signals for imitation learning agents to learn from. Supervised by our reinforcement learning coach, a baseline end-to-end agent with monocular camera-input achieves expert-level performance. Our end-to-end agent achieves a 78% success rate while generalizing to a new town and new weather on the NoCrash-dense benchmark and state-of-the-art performance on the more challenging CARLA LeaderBoard.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Fri, 17 Sep 21

Keyword: Text Detection

Urdu text in natural scene images: a new dataset and preliminary text detection

Authors: Hazrat Ali, Khalid Iqbal, Ghulam Mujtaba, Ahmad Fayyaz, Mohammad Farhad Bulbul, Fazal Wahab Karam, Ali Zahir
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2109.08060
Pdf link: https://arxiv.org/pdf/2109.08060
Abstract
Text detection in natural scene images for content analysis is an interesting task. The research community has seen some great developments for English/Mandarin text detection. However, Urdu text extraction in natural scene images is a task not well addressed. In this work, firstly, a new dataset is introduced for Urdu text in natural scene images. The dataset comprises of 500 standalone images acquired from real scenes. Secondly, the channel enhanced Maximally Stable Extremal Region (MSER) method is applied to extract Urdu text regions as candidates in an image. Two-stage filtering mechanism is applied to eliminate non-candidate regions. In the first stage, text and noise are classified based on their geometric properties. In the second stage, a support vector machine classifier is trained to discard non-text candidate regions. After this, text candidate regions are linked using centroid-based vertical and horizontal distances. Text lines are further analyzed by a different classifier based on HOG features to remove non-text regions. Extensive experimentation is performed on the locally developed dataset to evaluate the performance. The experimental results show good performance on test set images. The dataset will be made available for research use. To the best of our knowledge, the work is the first of its kind for the Urdu language and would provide a good dataset for free research use and serve as a baseline performance on the task of Urdu text extraction.

Keyword: Text Recognition

There is no result

Keyword: OCR

An influencer-based approach to understanding radical right viral tweets

Authors: Laila Sprejer, Helen Margetts, Kleber Oliveira, David O'Sullivan, Bertie Vidgen
Subjects: Social and Information Networks (cs.SI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2109.07588
Pdf link: https://arxiv.org/pdf/2109.07588
Abstract
Radical right influencers routinely use social media to spread highly divisive, disruptive and anti-democratic messages. Assessing and countering the challenge that such content poses is crucial for ensuring that online spaces remain open, safe and accessible. Previous work has paid little attention to understanding factors associated with radical right content that goes viral. We investigate this issue with a new dataset ROT which provides insight into the content, engagement and followership of a set of 35 radical right influencers. It includes over 50,000 original entries and over 40 million retweets, quotes, replies and mentions. We use a multilevel model to measure engagement with tweets, which are nested in each influencer. We show that it is crucial to account for the influencer-level structure, and find evidence of the importance of both influencer- and content-level factors, including the number of followers each influencer has, the type of content (original posts, quotes and replies), the length and toxicity of content, and whether influencers request retweets. We make ROT available for other researchers to use.

Auditing Fairness and Imputation Impact in Predictive Analytics for Higher Education

Authors: Hadis Anahideh, Nazanin Nezami, Denisa G`andara
Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2109.07908
Pdf link: https://arxiv.org/pdf/2109.07908
Abstract
Nowadays, colleges and universities use predictive analytics in a variety of ways to increase student success rates. Despite the potentials for predictive analytics, there exist two major barriers to their adoption in higher education: (a) the lack of democratization in deployment, and (b) the potential to exacerbate inequalities. Education researchers and policymakers encounter numerous challenges in deploying predictive modeling in practice. These challenges present in different steps of modeling including data preparation, model development, and evaluation. Nevertheless, each of these steps can introduce additional bias to the system if not appropriately performed. Most large-scale and nationally representative education data sets suffer from a significant number of incomplete responses from the research participants. Missing Values are the frequent latent causes behind many data analysis challenges. While many education-related studies addressed the challenges of missing data, little is known about the impact of handling missing values on the fairness of predictive outcomes in practice. In this paper, we set out to first assess the disparities in predictive modeling outcome for college-student success, then investigate the impact of imputation techniques on the model performance and fairness using a comprehensive set of common metrics. The comprehensive analysis of a real large-scale education dataset reveals key insights on the modeling disparity and how different imputation techniques fundamentally compare to one another in terms of their impact on the fairness of the student-success predictive outcome.

Surveying the Research on Fake News in Social Media: a Tale of Networks and Language

Authors: Giancarlo Ruffo (1), Alfonso Semeraro (1), Anastasia Giachanou (2), Paolo Rosso (3) ((1) Università degli Studi di Torino, (2) Utrecht University, (3) Universitat Politècnica de València)
Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2109.07909
Pdf link: https://arxiv.org/pdf/2109.07909
Abstract
The history of journalism and news diffusion is tightly coupled with the effort to dispel hoaxes, misinformation, propaganda, unverified rumours, poor reporting, and messages containing hate and divisions. With the explosive growth of online social media and billions of individuals engaged with consuming, creating, and sharing news, this ancient problem has surfaced with a renewed intensity threatening our democracies, public health, and news outlets credibility. This has triggered many researchers to develop new methods for studying, understanding, detecting, and preventing fake-news diffusion; as a consequence, thousands of scientific papers have been published in a relatively short period, making researchers of different disciplines to struggle in search of open problems and most relevant trends. The aim of this survey is threefold: first, we want to provide the researchers interested in this multidisciplinary and challenging area with a network-based analysis of the existing literature to assist them with a visual exploration of papers that can be of interest; second, we present a selection of the main results achieved so far adopting the network as an unifying framework to represent and make sense of data, to model diffusion processes, and to evaluate different debunking strategies. Finally, we present an outline of the most relevant research trends focusing on the moving target of fake-news, bots, and trolls identification by means of data mining and text technologies; despite scholars working on computational linguistics and networks traditionally belong to different scientific communities, we expect that forthcoming computational approaches to prevent fake news from polluting the social media must be developed using hybrid and up-to-date methodologies.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Thu, 26 Aug 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

There is no result

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Wed, 13 Oct 21

Keyword: Text Detection

On Exploring and Improving Robustness of Scene Text Detection Models

Authors: Shilian Wu, Wei Zhai, Yongrui Li, Kewei Wang, Zengfu Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.05700
Pdf link: https://arxiv.org/pdf/2110.05700
Abstract
It is crucial to understand the robustness of text detection models with regard to extensive corruptions, since scene text detection techniques have many practical applications. For systematically exploring this problem, we propose two datasets from which to evaluate scene text detection models: ICDAR2015-C (IC15-C) and CTW1500-C (CTW-C). Our study extends the investigation of the performance and robustness of the proposed region proposal, regression and segmentation-based scene text detection frameworks. Furthermore, we perform a robustness analysis of six key components: pre-training data, backbone, feature fusion module, multi-scale predictions, representation of text instances and loss function. Finally, we present a simple yet effective data-based method to destroy the smoothness of text regions by merging background and foreground, which can significantly increase the robustness of different text detection networks. We hope that this study will provide valid data points as well as experience for future research. Benchmark, code and data will be made available at \url{https://github.com/wushilian/robust-scene-text-detection-benchmark}.

Keyword: Text Recognition

Rescoring Sequence-to-Sequence Models for Text Line Recognition with CTC-Prefixes

Authors: Christoph Wick, Jochen Zöllner, Tobias Grüning
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.05909
Pdf link: https://arxiv.org/pdf/2110.05909
Abstract
In contrast to Connectionist Temporal Classification (CTC) approaches, Sequence-To-Sequence (S2S) models for Handwritten Text Recognition (HTR) suffer from errors such as skipped or repeated words which often occur at the end of a sequence. In this paper, to combine the best of both approaches, we propose to use the CTC-Prefix-Score during S2S decoding. Hereby, during beam search, paths that are invalid according to the CTC confidence matrix are penalised. Our network architecture is composed of a Convolutional Neural Network (CNN) as visual backbone, bidirectional Long-Short-Term-Memory-Cells (LSTMs) as encoder, and a decoder which is a Transformer with inserted mutual attention layers. The CTC confidences are computed on the encoder while the Transformer is only used for character-wise S2S decoding. We evaluate this setup on three HTR data sets: IAM, Rimes, and StAZH. On IAM, we achieve a competitive Character Error Rate (CER) of 2.95% when pretraining our model on synthetic data and including a character-based language model for contemporary English. Compared to other state-of-the-art approaches, our model requires about 10-20 times less parameters. Access our shared implementations via this link to GitHub: https://github.com/Planet-AI-GmbH/tfaip-hybrid-ctc-s2s.

Keyword: OCR

Perspective-taking to Reduce Affective Polarization on Social Media

Authors: Martin Saveski, Nabeel Gillani, Ann Yuan, Prashanth Vijayaraghavan, Deb Roy
Subjects: Social and Information Networks (cs.SI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2110.05596
Pdf link: https://arxiv.org/pdf/2110.05596
Abstract
The intensification of affective polarization worldwide has raised new questions about how social media platforms might be further fracturing an already-divided public sphere. As opposed to ideological polarization, affective polarization is defined less by divergent policy preferences and more by strong negative emotions towards opposing political groups, and thus arguably poses a formidable threat to rational democratic discourse. We explore if prompting perspective-taking on social media platforms can help enhance empathy between opposing groups as a first step towards reducing affective polarization. Specifically, we deploy a randomized field experiment through a browser extension to 1,611 participants on Twitter, which enables participants to randomly replace their feeds with those belonging to accounts whose political views either agree with or diverge from their own. We find that simply exposing participants to "outgroup" feeds enhances engagement, but not an understanding of why others hold their political views. On the other hand, framing the experience in familiar, empathic terms by prompting participants to recall a disagreement with a friend does not affect engagement, but does increase their ability to understand opposing views. Our findings illustrate how social media platforms might take simple steps that align with business objectives to reduce affective polarization.

On the Security Risks of AutoML

Authors: Ren Pang, Zhaohan Xi, Shouling Ji, Xiapu Luo, Ting Wang
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.06018
Pdf link: https://arxiv.org/pdf/2110.06018
Abstract
Neural Architecture Search (NAS) represents an emerging machine learning (ML) paradigm that automatically searches for models tailored to given tasks, which greatly simplifies the development of ML systems and propels the trend of ML democratization. Yet, little is known about the potential security risks incurred by NAS, which is concerning given the increasing use of NAS-generated models in critical domains. This work represents a solid initial step towards bridging the gap. Through an extensive empirical study of 10 popular NAS methods, we show that compared with their manually designed counterparts, NAS-generated models tend to suffer greater vulnerability to various malicious attacks (e.g., adversarial evasion, model poisoning, and functionality stealing). Further, with both empirical and analytical evidence, we provide possible explanations for such phenomena: given the prohibitive search space and training cost, most NAS methods favor models that converge fast at early training stages; this preference results in architectural properties associated with attack vulnerability (e.g., high loss smoothness and low gradient variance). Our findings not only reveal the relationships between model characteristics and attack vulnerability but also suggest the inherent connections underlying different attacks. Finally, we discuss potential remedies to mitigate such drawbacks, including increasing cell depth and suppressing skip connects, which lead to several promising research directions.

Keyword: Handwriting

There is no result

Keyword: Scene Text

On Exploring and Improving Robustness of Scene Text Detection Models

Authors: Shilian Wu, Wei Zhai, Yongrui Li, Kewei Wang, Zengfu Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.05700
Pdf link: https://arxiv.org/pdf/2110.05700
Abstract
It is crucial to understand the robustness of text detection models with regard to extensive corruptions, since scene text detection techniques have many practical applications. For systematically exploring this problem, we propose two datasets from which to evaluate scene text detection models: ICDAR2015-C (IC15-C) and CTW1500-C (CTW-C). Our study extends the investigation of the performance and robustness of the proposed region proposal, regression and segmentation-based scene text detection frameworks. Furthermore, we perform a robustness analysis of six key components: pre-training data, backbone, feature fusion module, multi-scale predictions, representation of text instances and loss function. Finally, we present a simple yet effective data-based method to destroy the smoothness of text regions by merging background and foreground, which can significantly increase the robustness of different text detection networks. We hope that this study will provide valid data points as well as experience for future research. Benchmark, code and data will be made available at \url{https://github.com/wushilian/robust-scene-text-detection-benchmark}.

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Wed, 20 Oct 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

BlockIoT: Blockchain-based Health Data Integration using IoT Devices

Authors: Manan Shukla, Jianjing Lin, Oshani Seneviratne
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2110.10123
Pdf link: https://arxiv.org/pdf/2110.10123
Abstract
The development and adoption of Electronic Health Records (EHR) and health monitoring Internet of Things (IoT) Devices have enabled digitization of patient records and has also substantially transformed the healthcare delivery system in aspects such as remote patient monitoring, healthcare decision making, and medical research. However, data tends to be fragmented among health infrastructures and prevents interoperability of medical data at the point of care. In order to address this gap, we introduce BlockIoT that uses blockchain technology to transfer previously inaccessible and centralized data from medical devices to EHR systems, which provides greater insight to providers who can, in turn, provide better outcomes for patients. This notion of interoperability of medical device data is possible through an Application Programming Interface (API), which serves as a versatile endpoint for all incoming medical device data, a distributed file system that ensures data resilience, and knowledge templates that analyze, identify, and represent medical device data to providers. Our participatory design survey on BlockIoT demonstrates that BlockIoT is a suitable system to supplement physicians' clinical practice and increases efficiency in most healthcare specialties, including cardiology, pulmonology, endocrinology, and primary care.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Fri, 1 Oct 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

There is no result

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Wed, 8 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System

Authors: Yuning Du, Chenxia Li, Ruoyu Guo, Cheng Cui, Weiwei Liu, Jun Zhou, Bin Lu, Yehua Yang, Qiwen Liu, Xiaoguang Hu, Dianhai Yu, Yanjun Ma
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2109.03144
Pdf link: https://arxiv.org/pdf/2109.03144
Abstract
Optical Character Recognition (OCR) systems have been widely used in various of application scenarios. Designing an OCR system is still a challenging task. In previous work, we proposed a practical ultra lightweight OCR system (PP-OCR) to balance the accuracy against the efficiency. In order to improve the accuracy of PP-OCR and keep high efficiency, in this paper, we propose a more robust OCR system, i.e. PP-OCRv2. We introduce bag of tricks to train a better text detector and a better text recognizer, which include Collaborative Mutual Learning (CML), CopyPaste, Lightweight CPUNetwork (LCNet), Unified-Deep Mutual Learning (U-DML) and Enhanced CTCLoss. Experiments on real data show that the precision of PP-OCRv2 is 7% higher than PP-OCR under the same inference cost. It is also comparable to the server models of the PP-OCR which uses ResNet series as backbones. All of the above mentioned models are open-sourced and the code is available in the GitHub repository PaddleOCR which is powered by PaddlePaddle.

Keyword: Handwriting

Support Vector Machine for Handwritten Character Recognition

Authors: Jomy John
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2109.03081
Pdf link: https://arxiv.org/pdf/2109.03081
Abstract
Handwriting recognition has been one of the most fascinating and challenging research areas in field of image processing and pattern recognition. It contributes enormously to the improvement of automation process. In this paper, a system for recognition of unconstrained handwritten Malayalam characters is proposed. A database of 10,000 character samples of 44 basic Malayalam characters is used in this work. A discriminate feature set of 64 local and 4 global features are used to train and test SVM classifier and achieved 92.24% accuracy

Keyword: Scene Text

STRIVE: Scene Text Replacement In Videos

Authors: Vijay Kumar B G, Jeyasri Subramanian, Varnith Chordia, Eugene Bart, Shaobo Fang, Kelly Guan, Raja Bala
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2109.02762
Pdf link: https://arxiv.org/pdf/2109.02762
Abstract
We propose replacing scene text in videos using deep style transfer and learned photometric transformations.Building on recent progress on still image text replacement,we present extensions that alter text while preserving the appearance and motion characteristics of the original video.Compared to the problem of still image text replacement,our method addresses additional challenges introduced by video, namely effects induced by changing lighting, motion blur, diverse variations in camera-object pose over time,and preservation of temporal consistency. We parse the problem into three steps. First, the text in all frames is normalized to a frontal pose using a spatio-temporal trans-former network. Second, the text is replaced in a single reference frame using a state-of-art still-image text replacement method. Finally, the new text is transferred from the reference to remaining frames using a novel learned image transformation network that captures lighting and blur effects in a temporally consistent manner. Results on synthetic and challenging real videos show realistic text trans-fer, competitive quantitative and qualitative performance,and superior inference speed relative to alternatives. We introduce new synthetic and real-world datasets with paired text objects. To the best of our knowledge this is the first attempt at deep video text replacement.

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Wed, 22 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

Authors: Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2109.10282
Pdf link: https://arxiv.org/pdf/2109.10282
Abstract
Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. The code and models will be publicly available at https://aka.ms/TrOCR.

Keyword: OCR

A Data-Driven Democratized Control Architecture for Regional Transmission Operators

Authors: Xiaoyuan Fan, Daniel Moscovitz, Liang Du, Walid Saad
Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2109.09813
Pdf link: https://arxiv.org/pdf/2109.09813
Abstract
As probably the most complicated and critical infrastructure system, U.S. power grids become increasingly vulnerable to extreme events such as cyber-attacks and severe weather, as well as higher DER penetrations and growing information mismatch among system operators, utilities (transmission or generation owners), and end-users. This paper proposes a data-driven democratized control architecture considering two democratization pathways to assist transmission system operators, with a targeted use case of developing online proactive islanding strategies. Detailed discussions on load capability profiling at transmission buses and disaggregation of DER generations are provided and illustrated with real-world utility data. By Combining network and operational constraints, transmission system operators can be equipped with new tools built on top of this architecture, to derive accurate, proactive, and strategic islanding decisions to incorporate the wide range of dynamic portfolios and needs when facing extreme events or unseen grid contingencies.

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

Authors: Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2109.10282
Pdf link: https://arxiv.org/pdf/2109.10282
Abstract
Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. The code and models will be publicly available at https://aka.ms/TrOCR.

Keyword: Handwriting

There is no result

Keyword: Scene Text

Oriented Object Detection in Aerial Images Based on Area Ratio of Parallelogram

Authors: Xinyu Yu, Mi Lin, Jiangping Lu, Linlin Ou
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2109.10187
Pdf link: https://arxiv.org/pdf/2109.10187
Abstract
Rotated object detection is a challenging task in aerial images as the object in aerial images are displayed in arbitrary directions and usually densely packed. Although considerable progress has been made, there are still challenges that existing regression-based rotation detectors suffer the problem of discontinuous boundaries, which is directly caused by angular periodicity or corner ordering. In this paper, we propose a simple effective framework to address the above challenges. Instead of directly regressing the five parameters (coordinates of the central point, width, height, and rotation angle) or the four vertices, we use the area ratio of parallelogram (ARP) to accurately describe a multi-oriented object. Specifically, we regress coordinates of center point, height and width of minimum circumscribed rectangle of oriented object and three area ratios {\lambda}_1, {\lambda}_2 and {\lambda}_3. This may facilitate the offset learning and avoid the issue of angular periodicity or label points sequence for oriented objects. To further remedy the confusion issue nearly horizontal objects, we employ the area ratio between the object and its horizontal bounding box (minimum circumscribed rectangle) to guide the selection of horizontal or oriented detection for each object. We also propose a rotation efficient IoU loss (R-EIoU) to connect the horizontal bounding box with the three area ratios and improve the accurate for the rotating bounding box. Experimental results on three remote sensing datasets including HRSC2016, DOTA and UCAS-AOD and scene text including ICDAR2015 show that our method achieves superior detection performance compared with many state-of-the-art approaches. The code and model will be coming with paper published.

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Thu, 19 Aug 21

Keyword: Text Detection

There is no result

New submissions for Mon, 13 Sep 21

Keyword: Text Detection

Artificial Text Detection via Examining the Topology of Attention Maps

Authors: Laida Kushnareva, Daniil Cherniavskii, Vladislav Mikhailov, Ekaterina Artemova, Serguei Barannikov, Alexander Bernstein, Irina Piontkovskaya, Dmitri Piontkovski, Evgeny Burnaev
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2109.04825
Pdf link: https://arxiv.org/pdf/2109.04825
Abstract
The impressive capabilities of recent generative models to create texts that are challenging to distinguish from the human-written ones can be misused for generating fake news, product reviews, and even abusive content. Despite the prominent performance of existing methods for artificial text detection, they still lack interpretability and robustness towards unseen models. To this end, we propose three novel types of interpretable topological features for this task based on Topological Data Analysis (TDA) which is currently understudied in the field of NLP. We empirically show that the features derived from the BERT model outperform count- and neural-based baselines up to 10% on three common datasets, and tend to be the most robust towards unseen GPT-style generation models as opposed to existing methods. The probing analysis of the features reveals their sensitivity to the surface and syntactic properties. The results demonstrate that TDA is a promising line with respect to NLP tasks, specifically the ones that incorporate surface and structural information.

Keyword: Text Recognition

There is no result

Keyword: OCR

FR-Detect: A Multi-Modal Framework for Early Fake News Detection on Social Media Using Publishers Features

Authors: Ali Jarrahi, Leila Safari
Subjects: Social and Information Networks (cs.SI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2109.04835
Pdf link: https://arxiv.org/pdf/2109.04835
Abstract
In recent years, with the expansion of the Internet and attractive social media infrastructures, people prefer to follow the news through these media. Despite the many advantages of these media in the news field, the lack of any control and verification mechanism has led to the spread of fake news, as one of the most important threats to democracy, economy, journalism and freedom of expression. Designing and using automatic methods to detect fake news on social media has become a significant challenge. In this paper, we examine the publishers' role in detecting fake news on social media. We also suggest a high accurate multi-modal framework, namely FR-Detect, using user-related and content-related features with early detection capability. For this purpose, two new user-related features, namely Activity Credibility and Influence, have been introduced for publishers. Furthermore, a sentence-level convolutional neural network is provided to combine these features with latent textual content features properly. Experimental results have shown that the publishers' features can improve the performance of content-based models by up to 13% and 29% in accuracy and F1-score, respectively.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Mon, 23 Aug 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling

Authors: Xiaopeng Lu, Zhen Fan, Yansen Wang, Jean Oh, Carolyn P. Rose
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2108.08965
Pdf link: https://arxiv.org/pdf/2108.08965
Abstract
As an important task in multimodal context understanding, Text-VQA (Visual Question Answering) aims at question answering through reading text information in images. It differentiates from the original VQA task as Text-VQA requires large amounts of scene-text relationship understanding, in addition to the cross-modal grounding capability. In this paper, we propose Localize, Group, and Select (LOGOS), a novel model which attempts to tackle this problem from multiple aspects. LOGOS leverages two grounding tasks to better localize the key information of the image, utilizes scene text clustering to group individual OCR tokens, and learns to select the best answer from different sources of OCR (Optical Character Recognition) texts. Experiments show that LOGOS outperforms previous state-of-the-art methods on two Text-VQA benchmarks without using additional OCR annotation data. Ablation studies and analysis demonstrate the capability of LOGOS to bridge different modalities and better understand scene text.

Keyword: Handwriting

There is no result

Keyword: Scene Text

Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling

Authors: Xiaopeng Lu, Zhen Fan, Yansen Wang, Jean Oh, Carolyn P. Rose
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2108.08965
Pdf link: https://arxiv.org/pdf/2108.08965
Abstract
As an important task in multimodal context understanding, Text-VQA (Visual Question Answering) aims at question answering through reading text information in images. It differentiates from the original VQA task as Text-VQA requires large amounts of scene-text relationship understanding, in addition to the cross-modal grounding capability. In this paper, we propose Localize, Group, and Select (LOGOS), a novel model which attempts to tackle this problem from multiple aspects. LOGOS leverages two grounding tasks to better localize the key information of the image, utilizes scene text clustering to group individual OCR tokens, and learns to select the best answer from different sources of OCR (Optical Character Recognition) texts. Experiments show that LOGOS outperforms previous state-of-the-art methods on two Text-VQA benchmarks without using additional OCR annotation data. Ablation studies and analysis demonstrate the capability of LOGOS to bridge different modalities and better understand scene text.

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Tue, 19 Oct 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

PAGnol: An Extra-Large French Generative Model

Authors: Julien Launay, E.L. Tommasone, Baptiste Pannier, François Boniface, Amélie Chatelain, Alessandro Cappelli, Iacopo Poli, Djamé Seddah
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2110.08554
Pdf link: https://arxiv.org/pdf/2110.08554
Abstract
Access to large pre-trained models of varied architectures, in many different languages, is central to the democratization of NLP. We introduce PAGnol, a collection of French GPT models. Using scaling laws, we efficiently train PAGnol-XL (1.5B parameters) with the same computational budget as CamemBERT, a model 13 times smaller. PAGnol-XL is the largest model trained to date for the French language. We plan to train increasingly large and performing versions of PAGnol, exploring the capabilities of French extreme-scale models. For this first release, we focus on the pre-training and scaling calculations underlining PAGnol. We fit a scaling law for compute for the French language, and compare it with its English counterpart. We find the pre-training dataset significantly conditions the quality of the outputs, with common datasets such as OSCAR leading to low-quality offensive text. We evaluate our models on discriminative and generative tasks in French, comparing to other state-of-the-art French and multilingual models, and reaching the state of the art in the abstract summarization task. Our research was conducted on the public GENCI Jean Zay supercomputer, and our models up to the Large are made publicly available.

Learning UI Navigation through Demonstrations composed of Macro Actions

Authors: Wei Li
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2110.08653
Pdf link: https://arxiv.org/pdf/2110.08653
Abstract
We have developed a framework to reliably build agents capable of UI navigation. The state space is simplified from raw-pixels to a set of UI elements extracted from screen understanding, such as OCR and icon detection. The action space is restricted to the UI elements plus a few global actions. Actions can be customized for tasks and each action is a sequence of basic operations conditioned on status checks. With such a design, we are able to train DQfD and BC agents with a small number of demonstration episodes. We propose demo augmentation that significantly reduces the required number of human demonstrations. We made a customization of DQfD to allow demos collected on screenshots to facilitate the demo coverage of rare cases. Demos are only collected for the failed cases during the evaluation of the previous version of the agent. With 10s of iterations looping over evaluation, demo collection, and training, the agent reaches a 98.7% success rate on the search task in an environment of 80+ apps and websites where initial states and viewing parameters are randomized.

Measuring the influence of beliefs in belief networks

Authors: Aleksandar Tomašević
Subjects: Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/2110.09154
Pdf link: https://arxiv.org/pdf/2110.09154
Abstract
Influential beliefs are crucial for our understanding of how people reason about political issues and make political decisions. This research proposes a new method for measuring the influence of political beliefs within larger context of belief system networks, based on the advances in psychometric network methods and network influence research. Using the latest round of the European Social Survey data, we demonstrate this approach on a belief network expressing support for the regime in 29 European countries and capturing beliefs related to support for regime performance, principles, institutions, and political actors. Our results show that the average influence of beliefs can be related to the consistency and connectivity of the belief network and that the influence of specific beliefs (e.g. Satisfaction with Democracy) on a country level has a significant negative correlation with external indicators from the same domain (e.g. Liberal Democracy index), which suggests that highly influential beliefs are related to pressing political issues. These findings suggest that network-based belief influence metrics estimated from large-scale survey data can be used a new type of indicator in comparative political research, which opens new avenues for integrating psychometric network analysis methods into political science methodology.

Newsalyze: Effective Communication of Person-Targeting Biases in News Articles

Authors: Felix Hamborg, Kim Heinser, Anastasia Zhukova, Karsten Donnay, Bela Gipp
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2110.09158
Pdf link: https://arxiv.org/pdf/2110.09158
Abstract
Media bias and its extreme form, fake news, can decisively affect public opinion. Especially when reporting on policy issues, slanted news coverage may strongly influence societal decisions, e.g., in democratic elections. Our paper makes three contributions to address this issue. First, we present a system for bias identification, which combines state-of-the-art methods from natural language understanding. Second, we devise bias-sensitive visualizations to communicate bias in news articles to non-expert news consumers. Third, our main contribution is a large-scale user study that measures bias-awareness in a setting that approximates daily news consumption, e.g., we present respondents with a news overview and individual articles. We not only measure the visualizations' effect on respondents' bias-awareness, but we can also pinpoint the effects on individual components of the visualizations by employing a conjoint design. Our bias-sensitive overviews strongly and significantly increase bias-awareness in respondents. Our study further suggests that our content-driven identification method detects groups of similarly slanted news articles due to substantial biases present in individual news articles. In contrast, the reviewed prior work rather only facilitates the visibility of biases, e.g., by distinguishing left- and right-wing outlets.

Machine Learning Featurizations for AI Hacking of Political Systems

Authors: Nathan E Sanders, Bruce Schneier
Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2110.09231
Pdf link: https://arxiv.org/pdf/2110.09231
Abstract
What would the inputs be to a machine whose output is the destabilization of a robust democracy, or whose emanations could disrupt the political power of nations? In the recent essay "The Coming AI Hackers," Schneier (2021) proposed a future application of artificial intelligences to discover, manipulate, and exploit vulnerabilities of social, economic, and political systems at speeds far greater than humans' ability to recognize and respond to such threats. This work advances the concept by applying to it theory from machine learning, hypothesizing some possible "featurization" (input specification and transformation) frameworks for AI hacking. Focusing on the political domain, we develop graph and sequence data representations that would enable the application of a range of deep learning models to predict attributes and outcomes of political systems. We explore possible data models, datasets, predictive tasks, and actionable applications associated with each framework. We speculate about the likely practical impact and feasibility of such models, and conclude by discussing their ethical implications.

Towards responsible research in digital technology for health care

Authors: Pierre Jannin
Subjects: Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2110.09255
Pdf link: https://arxiv.org/pdf/2110.09255
Abstract
Digital technology is everywhere for the benefit of our daily and professional life. It strongly impacts our life and was crucial to maintain professional and social activities during the COVID19 crisis. Similarly, digital technologies are key within biomedical engineering research topics. Innovations have been generated and introduced over the last 40 years, demonstrating how computing and digital technologies have impacted health care. Although the benefits of digital technology are obvious now, we are at the convergence of several issues which makes us aware about social, societal and environmental challenges associated with this technology. In the social domain, digital technologies raise concern about exclusion (financial, geographical, educational, demographical, racial, gender, language, and disabled related exclusion) and physical and mental health. In the societal dimension, digital technologies raise concern about politics and democracy (sovereignty and governance, cognitive filters and citizen's engagement), privacy and security (data acquisition and usage transparency, level of personal approval, and level of anonymization), and economics. In the environmental dimension, digital technologies raise concern about energy consumption and hardware production. This paper introduces and defines these challenges for digital technology in general, as well as when applied to health care. The objective of this paper is to make the research community more aware about the challenges of digital technology and to promote more transparency for innovative and responsible research.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Tue, 7 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

Printed Texts Tracking and Following for a Finger-Wearable Electro-Braille System Through Opto-electrotactile Feedback

Authors: Mehdi Rahimi, Yantao Shen, Zhiming Liu, Fang Jiang
Subjects: Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2109.02385
Pdf link: https://arxiv.org/pdf/2109.02385
Abstract
This paper presents our recent development on a portable and refreshable text reading and sensory substitution system for the blind or visually impaired (BVI), called Finger-eye. The system mainly consists of an opto-text processing unit and a compact electro-tactile based display that can deliver text-related electrical signals to the fingertip skin through a wearable and Braille-dot patterned electrode array and thus delivers the electro-stimulation based Braille touch sensations to the fingertip. To achieve the goal of aiding BVI to read any text not written in Braille through this portable system, in this work, a Rapid Optical Character Recognition (R-OCR) method is firstly developed for real-time processing text information based on a Fisheye imaging device mounted at the finger-wearable electro-tactile display. This allows real-time translation of printed text to electro-Braille along with natural movement of user's fingertip as if reading any Braille display or book. More importantly, an electro-tactile neuro-stimulation feedback mechanism is proposed and incorporated with the R-OCR method, which facilitates a new opto-electrotactile feedback based text line tracking control approach that enables text line following by user fingertip during reading. Multiple experiments were designed and conducted to test the ability of blindfolded participants to read through and follow the text line based on the opto-electrotactile-feedback method. The experiments show that as the result of the opto-electrotactile-feedback, the users were able to maintain their fingertip within a $2mm$ distance of the text while scanning a text line. This research is a significant step to aid the BVI users with a portable means to translate and follow to read any printed text to Braille, whether in the digital realm or physically, on any surface.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Fri, 8 Oct 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

Authors: Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, Zhendong Peng
Subjects: Sound (cs.SD); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2110.03370
Pdf link: https://arxiv.org/pdf/2110.03370
Abstract
In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation -- Dev for cross-validation purpose in training, Test_Net, collected from Internet for matched test, and Test_Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-sourced Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

Authors: Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, Zhendong Peng
Subjects: Sound (cs.SD); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2110.03370
Pdf link: https://arxiv.org/pdf/2110.03370
Abstract
In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation -- Dev for cross-validation purpose in training, Test_Net, collected from Internet for matched test, and Test_Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-sourced Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition.

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Mon, 27 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

SAIS: Supervising and Augmenting Intermediate Steps for Document-Level Relation Extraction

Authors: Yuxin Xiao, Zecheng Zhang, Yuning Mao, Carl Yang, Jiawei Han
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2109.12093
Pdf link: https://arxiv.org/pdf/2109.12093
Abstract
Stepping from sentence-level to document-level relation extraction, the research community confronts increasing text length and more complicated entity interactions. Consequently, it is more challenging to encode the key sources of information--relevant contexts and entity types. However, existing methods only implicitly learn to model these critical information sources while being trained for relation extraction. As a result, they suffer the problems of ineffective supervision and uninterpretable model predictions. In contrast, we propose to explicitly teach the model to capture relevant contexts and entity types by supervising and augmenting intermediate steps (SAIS) for relation extraction. Based on a broad spectrum of carefully designed tasks, our proposed SAIS method not only extracts relations of better quality due to more effective supervision, but also retrieves the corresponding supporting evidence more accurately so as to enhance interpretability. By assessing model uncertainty, SAIS further boosts the performance via evidence-based data augmentation and ensemble inference while reducing the computational cost. Eventually, SAIS delivers state-of-the-art relation extraction results on three benchmarks (DocRED, CDR, and GDA) and achieves 5.04% relative gains in F1 score compared to the runner-up in evidence retrieval on DocRED.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Fri, 24 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

There is no result

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Mon, 6 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

There is no result

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Thu, 19 Aug 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

AdapterHub Playground: Simple and Flexible Few-Shot Learning with Adapters

Authors: Tilman Beck, Bela Bohlender, Christina Viehmann, Vincent Hane, Yanik Adamson, Jaber Khuri, Jonas Brossmann, Jonas Pfeiffer, Iryna Gurevych
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2108.08103
Pdf link: https://arxiv.org/pdf/2108.08103
Abstract
The open-access dissemination of pretrained language models through online repositories has led to a democratization of state-of-the-art natural language processing (NLP) research. This also allows people outside of NLP to use such models and adapt them to specific use-cases. However, a certain amount of technical proficiency is still required which is an entry barrier for users who want to apply these models to a certain task but lack the necessary knowledge or resources. In this work, we aim to overcome this gap by providing a tool which allows researchers to leverage pretrained models without writing a single line of code. Built upon the parameter-efficient adapter modules for transfer learning, our AdapterHub Playground provides an intuitive interface, allowing the usage of adapters for prediction, training and analysis of textual data for a variety of NLP tasks. We present the tool's architecture and demonstrate its advantages with prototypical use-cases, where we show that predictive performance can easily be increased in a few-shot learning scenario. Finally, we evaluate its usability in a user study. We provide the code and a live interface at https://adapter-hub.github.io/playground.

End-to-End Urban Driving by Imitating a Reinforcement Learning Coach

Authors: Zhejun Zhang, Alexander Liniger, Dengxin Dai, Fisher Yu, Luc Van Gool
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2108.08265
Pdf link: https://arxiv.org/pdf/2108.08265
Abstract
End-to-end approaches to autonomous driving commonly rely on expert demonstrations. Although humans are good drivers, they are not good coaches for end-to-end algorithms that demand dense on-policy supervision. On the contrary, automated experts that leverage privileged information can efficiently generate large scale on-policy and off-policy demonstrations. However, existing automated experts for urban driving make heavy use of hand-crafted rules and perform suboptimally even on driving simulators, where ground-truth information is available. To address these issues, we train a reinforcement learning expert that maps bird's-eye view images to continuous low-level actions. While setting a new performance upper-bound on CARLA, our expert is also a better coach that provides informative supervision signals for imitation learning agents to learn from. Supervised by our reinforcement learning coach, a baseline end-to-end agent with monocular camera-input achieves expert-level performance. Our end-to-end agent achieves a 78% success rate while generalizing to a new town and new weather on the NoCrash-dense benchmark and state-of-the-art performance on the more challenging CARLA LeaderBoard.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Tue, 31 Aug 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

A Pluralist Approach to Democratizing Online Discourse

Authors: Jay Chen, Barath Raghavan, Paul Schmitt, Tai Liu
Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2108.12573
Pdf link: https://arxiv.org/pdf/2108.12573
Abstract
Online discourse takes place in corporate-controlled spaces thought by users to be public realms. These platforms in name enable free speech but in practice implement varying degrees of censorship either by government edict or by uneven and unseen corporate policy. This kind of censorship has no countervailing accountability mechanism, and as such platform owners, moderators, and algorithms shape public discourse without recourse or transparency. Systems research has explored approaches to decentralizing or democratizing Internet infrastructure for decades. In parallel, the Internet censorship literature is replete with efforts to measure and overcome online censorship. However, in the course of designing specialized open-source platforms and tools, projects generally neglect the needs of supportive but uninvolved `average' users. In this paper, we propose a pluralistic approach to democratizing online discourse that considers both the systems-related and user-facing issues as first-order design goals.

A Multimodal Framework for Video Ads Understanding

Authors: Zejia Weng, Lingchen Meng, Rui Wang, Zuxuan Wu, Yu-Gang Jiang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2108.12868
Pdf link: https://arxiv.org/pdf/2108.12868
Abstract
There is a growing trend in placing video advertisements on social platforms for online marketing, which demands automatic approaches to understand the contents of advertisements effectively. Taking the 2021 TAAC competition as an opportunity, we developed a multimodal system to improve the ability of structured analysis of advertising video content. In our framework, we break down the video structuring analysis problem into two tasks, i.e., scene segmentation and multi-modal tagging. In scene segmentation, we build upon a temporal convolution module for temporal modeling to predict whether adjacent frames belong to the same scene. In multi-modal tagging, we first compute clip-level visual features by aggregating frame-level features with NeXt-SoftDBoF. The visual features are further complemented with textual features that are derived using a global-local attention mechanism to extract useful information from OCR (Optical Character Recognition) and ASR (Audio Speech Recognition) outputs. Our solution achieved a score of 0.2470 measured in consideration of localization and prediction accuracy, ranking fourth in the 2021 TAAC final leaderboard.

3DStyleNet: Creating 3D Shapes with Geometric and Texture Style Variations

Authors: Kangxue Yin, Jun Gao, Maria Shugrina, Sameh Khamis, Sanja Fidler
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2108.12958
Pdf link: https://arxiv.org/pdf/2108.12958
Abstract
We propose a method to create plausible geometric and texture style variations of 3D objects in the quest to democratize 3D content creation. Given a pair of textured source and target objects, our method predicts a part-aware affine transformation field that naturally warps the source shape to imitate the overall geometric style of the target. In addition, the texture style of the target is transferred to the warped source object with the help of a multi-view differentiable renderer. Our model, 3DStyleNet, is composed of two sub-networks trained in two stages. First, the geometric style network is trained on a large set of untextured 3D shapes. Second, we jointly optimize our geometric style network and a pre-trained image style transfer network with losses defined over both the geometry and the rendering of the result. Given a small set of high-quality textured objects, our method can create many novel stylized shapes, resulting in effortless 3D content creation and style-ware data augmentation. We showcase our approach qualitatively on 3D content stylization, and provide user studies to validate the quality of our results. In addition, our method can serve as a valuable tool to create 3D data augmentations for computer vision tasks. Extensive quantitative analysis shows that 3DStyleNet outperforms alternative data augmentation techniques for the downstream task of single-image 3D reconstruction.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Tue, 28 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

There is no result

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Fri, 15 Oct 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

An algorithm for a fairer and better voting system

Authors: Gabriel-Claudiu Grama
Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Arxiv link: https://arxiv.org/abs/2110.07066
Pdf link: https://arxiv.org/pdf/2110.07066
Abstract
The major finding, of this article, is an ensemble method, but more exactly, a novel, better ranked voting system (and other variations of it), that aims to solve the problem of finding the best candidate to represent the voters. We have the source code on GitHub, for making realistic simulations of elections, based on artificial intelligence for comparing different variations of the algorithm, and other already known algorithms. We have convincing evidence that our algorithm is better than Instant-Runoff Voting, Preferential Block Voting, Single Transferable Vote, and First Past The Post (if certain, natural conditions are met, to support the wisdom of the crowds). By also comparing with the best voter, we demonstrated the wisdom of the crowds, suggesting that democracy (distributed system) is a better option than ************ (centralized system), if those certain, natural conditions are met. Voting systems are not restricted to politics, they are ensemble methods for artificial intelligence, but the context of this article is natural intelligence. It is important to find a system that is fair (e.g. freedom of expression on the ballot exists), especially when the outcome of the voting system has social impact: some voting systems have the unfair inevitability to trend (over time) towards the same two major candidates (Duverger's law).

Procrastinated Tree Search: Black-box Optimization with Delayed, Noisy, and Multi-fidelity Feedback

Authors: Junxiong Wang, Debabrota Basu, Immanuel Trummer
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2110.07232
Pdf link: https://arxiv.org/pdf/2110.07232
Abstract
In black-box optimization problems, we aim to maximize an unknown objective function, where the function is only accessible through feedbacks of an evaluation or simulation oracle. In real-life, the feedbacks of such oracles are often noisy and available after some unknown delay that may depend on the computation time of the oracle. Additionally, if the exact evaluations are expensive but coarse approximations are available at a lower cost, the feedbacks can have multi-fidelity. In order to address this problem, we propose a generic extension of hierarchical optimistic tree search (HOO), called ProCrastinated Tree Search (PCTS), that flexibly accommodates a delay and noise-tolerant bandit algorithm. We provide a generic proof technique to quantify regret of PCTS under delayed, noisy, and multi-fidelity feedbacks. Specifically, we derive regret bounds of PCTS enabled with delayed-UCB1 (DUCB1) and delayed-UCB-V (DUCBV) algorithms. Given a horizon $T$, PCTS retains the regret bound of non-delayed HOO for expected delay of $O(\log T)$ and worsens by $O(T^{\frac{1-\alpha}{d+2}})$ for expected delays of $O(T^{1-\alpha})$ for $\alpha \in (0,1]$. We experimentally validate on multiple synthetic functions and hyperparameter tuning problems that PCTS outperforms the state-of-the-art black-box optimization methods for feedbacks with different noise levels, delays, and fidelity.

Keyword: Handwriting

Data Incubation -- Synthesizing Missing Data for Handwriting Recognition

Authors: Jen-Hao Rick Chang, Martin Bresler, Youssouf Chherawala, Adrien Delaye, Thomas Deselaers, Ryan Dixon, Oncel Tuzel
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2110.07040
Pdf link: https://arxiv.org/pdf/2110.07040
Abstract
In this paper, we demonstrate how a generative model can be used to build a better recognizer through the control of content and style. We are building an online handwriting recognizer from a modest amount of training samples. By training our controllable handwriting synthesizer on the same data, we can synthesize handwriting with previously underrepresented content (e.g., URLs and email addresses) and style (e.g., cursive and slanted). Moreover, we propose a framework to analyze a recognizer that is trained with a mixture of real and synthetic training data. We use the framework to optimize data synthesis and demonstrate significant improvement on handwriting recognition over a model trained on real data only. Overall, we achieve a 66% reduction in Character Error Rate.

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

Transformer over Pre-trained Transformer for Neural Text Segmentation with Enhanced Topic Coherence

Authors: Kelvin Lo, Yuan Jin, Weicong Tan, Ming Liu, Lan Du, Wray Buntine
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2110.07160
Pdf link: https://arxiv.org/pdf/2110.07160
Abstract
This paper proposes a transformer over transformer framework, called Transformer$^2$, to perform neural text segmentation. It consists of two components: bottom-level sentence encoders using pre-trained transformers, and an upper-level transformer-based segmentation model based on the sentence embeddings. The bottom-level component transfers the pre-trained knowledge learned from large external corpora under both single and pair-wise supervised NLP tasks to model the sentence embeddings for the documents. Given the sentence embeddings, the upper-level transformer is trained to recover the segmentation boundaries as well as the topic labels of each sentence. Equipped with a multi-task loss and the pre-trained knowledge, Transformer$^2$ can better capture the semantic coherence within the same segments. Our experiments show that (1) Transformer$^2$ manages to surpass state-of-the-art text segmentation models in terms of a commonly-used semantic coherence measure; (2) in most cases, both single and pair-wise pre-trained knowledge contribute to the model performance; (3) bottom-level sentence encoders pre-trained on specific languages yield better performance than those pre-trained on specific domains.

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Tue, 24 Aug 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network

Authors: Yuxin Wang, Hongtao Xie, Shancheng Fang, Jing Wang, Shenggao Zhu, Yongdong Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2108.09661
Pdf link: https://arxiv.org/pdf/2108.09661
Abstract
In this paper, we abandon the dominant complex language model and rethink the linguistic learning process in the scene text recognition. Different from previous methods considering the visual and linguistic information in two separate structures, we propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union by directly enduing the vision model with language capability. Specially, we introduce the text recognition of character-wise occluded feature maps in the training stage. Such operation guides the vision model to use not only the visual texture of characters, but also the linguistic information in visual context for recognition when the visual cues are confused (e.g. occlusion, noise, etc.). As the linguistic information is acquired along with visual features without the need of extra language model, VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition. Furthermore, an Occlusion Scene Text (OST) dataset is proposed to evaluate the performance on the case of missing character-wise visual cues. The state of-the-art results on several benchmarks prove our effectiveness. Code and dataset are available at https://github.com/wangyuxin87/VisionLAN.

Keyword: OCR

Self-Regulation for Semantic Segmentation

Authors: Zhang Dong, Zhang Hanwang, Tang Jinhui, Hua Xiansheng, Sun Qianru
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2108.09702
Pdf link: https://arxiv.org/pdf/2108.09702
Abstract
In this paper, we seek reasons for the two major failure cases in Semantic Segmentation (SS): 1) missing small objects or minor object parts, and 2) mislabeling minor parts of large objects as wrong classes. We have an interesting finding that Failure-1 is due to the underuse of detailed features and Failure-2 is due to the underuse of visual contexts. To help the model learn a better trade-off, we introduce several Self-Regulation (SR) losses for training SS neural networks. By "self", we mean that the losses are from the model per se without using any additional data or supervision. By applying the SR losses, the deep layer features are regulated by the shallow ones to preserve more details; meanwhile, shallow layer classification logits are regulated by the deep ones to capture more semantics. We conduct extensive experiments on both weakly and fully supervised SS tasks, and the results show that our approach consistently surpasses the baselines. We also validate that SR losses are easy to implement in various state-of-the-art SS models, e.g., SPGNet and OCRNet, incurring little computational overhead during training and none for testing.

External Knowledge Augmented Text Visual Question Answering

Authors: Arka Ujjal Dey, Ernest Valveny, Gaurav Harit
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2108.09717
Pdf link: https://arxiv.org/pdf/2108.09717
Abstract
The open-ended question answering task of Text-VQA requires reading and reasoning about local, often previously unseen, scene-text content of an image to generate answers. In this work, we propose the generalized use of external knowledge to augment our understanding of the said scene-text. We design a framework to extract, filter, and encode knowledge atop a standard multimodal transformer for vision language understanding tasks. Through empirical evidence, we demonstrate how knowledge can highlight instance-only cues and thus help deal with training data bias, improve answer entity type correctness, and detect multiword named entities. We generate results comparable to the state-of-the-art on two publicly available datasets, under the constraints of similar upstream OCR systems and training data.

Keyword: Handwriting

There is no result

Keyword: Scene Text

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network

Authors: Yuxin Wang, Hongtao Xie, Shancheng Fang, Jing Wang, Shenggao Zhu, Yongdong Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2108.09661
Pdf link: https://arxiv.org/pdf/2108.09661
Abstract
In this paper, we abandon the dominant complex language model and rethink the linguistic learning process in the scene text recognition. Different from previous methods considering the visual and linguistic information in two separate structures, we propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union by directly enduing the vision model with language capability. Specially, we introduce the text recognition of character-wise occluded feature maps in the training stage. Such operation guides the vision model to use not only the visual texture of characters, but also the linguistic information in visual context for recognition when the visual cues are confused (e.g. occlusion, noise, etc.). As the linguistic information is acquired along with visual features without the need of extra language model, VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition. Furthermore, an Occlusion Scene Text (OST) dataset is proposed to evaluate the performance on the case of missing character-wise visual cues. The state of-the-art results on several benchmarks prove our effectiveness. Code and dataset are available at https://github.com/wangyuxin87/VisionLAN.

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Wed, 15 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

Post-OCR Document Correction with large Ensembles of Character Sequence Models

Authors: Juan Ramirez-Orta, Eduardo Xamena, Ana Maguitman, Evangelos Milios, Axel J. Soto
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2109.06264
Pdf link: https://arxiv.org/pdf/2109.06264
Abstract
In this paper, we propose a novel method based on character sequence-to-sequence models to correct documents already processed with Optical Character Recognition (OCR) systems. The main contribution of this paper is a set of strategies to accurately process strings much longer than the ones used to train the sequence model while being sample- and resource-efficient, supported by thorough experimentation. The strategy with the best performance involves splitting the input document in character n-grams and combining their individual corrections into the final output using a voting scheme that is equivalent to an ensemble of a large number of sequence models. We further investigate how to weigh the contributions from each one of the members of this ensemble. We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art performance in five of them. Our code for post-OCR correction is shared at https://github.com/jarobyte91/post_ocr_correction.

Optimal To-Do List Gamification for Long Term Planning

Authors: Saksham Consul, Jugoslav Stojcheski, Valkyrie Felso, Falk Lieder
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2109.06505
Pdf link: https://arxiv.org/pdf/2109.06505
Abstract
Most people struggle with prioritizing work. While inexact heuristics have been developed over time, there is still no tractable principled algorithm for deciding which of the many possible tasks one should tackle in any given day, month, week, or year. Additionally, some people suffer from cognitive biases such as the present bias, leading to prioritization of their immediate experience over long-term consequences which manifests itself as procrastination and inefficient task prioritization. Our method utilizes optimal gamification to help people overcome these problems by incentivizing each task by a number of points that convey how valuable it is in the long-run. We extend the previous version of our optimal gamification method with added services for helping people decide which tasks should and should not be done when there is not enough time to do everything. To improve the efficiency and scalability of the to-do list solver, we designed a hierarchical procedure that tackles the problem from the top-level goals to fine-grained tasks. We test the accuracy of the incentivised to-do list by comparing the performance of the strategy with the points computed exactly using Value Iteration for a variety of case studies. These case studies were specifically designed to cover the corner cases to get an accurate judge of performance. Our method yielded the same performance as the exact method for all case studies. To demonstrate its functionality, we released an API that makes it easy to deploy our method in Web and app services. We assessed the scalability of our method by applying it to to-do lists with increasingly larger numbers of goals, sub-goals per goal, hierarchically nested levels of subgoals. We found that the method provided through our API is able to tackle fairly large to-do lists having a 576 tasks. This indicates that our method is suitable for real-world applications.

Learning Bill Similarity with Annotated and Augmented Corpora of Bills

Authors: Jiseon Kim, Elden Griggs, In Song Kim, Alice Oh
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2109.06527
Pdf link: https://arxiv.org/pdf/2109.06527
Abstract
Bill writing is a critical element of representative democracy. However, it is often overlooked that most legislative bills are derived, or even directly copied, from other bills. Despite the significance of bill-to-bill linkages for understanding the legislative process, existing approaches fail to address semantic similarities across bills, let alone reordering or paraphrasing which are prevalent in legal document writing. In this paper, we overcome these limitations by proposing a 5-class classification task that closely reflects the nature of the bill generation process. In doing so, we construct a human-labeled dataset of 4,721 bill-to-bill relationships at the subsection-level and release this annotated dataset to the research community. To augment the dataset, we generate synthetic data with varying degrees of similarity, mimicking the complex bill writing process. We use BERT variants and apply multi-stage training, sequentially fine-tuning our models with synthetic and human-labeled datasets. We find that the predictive performance significantly improves when training with both human-labeled and synthetic data. Finally, we apply our trained model to infer section- and bill-level similarities. Our analysis shows that the proposed methodology successfully captures the similarities across legal documents at various levels of aggregation.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

Learning Constraints and Descriptive Segmentation for Subevent Detection

Authors: Haoyu Wang, Hongming Zhang, Muhao Chen, Dan Roth
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2109.06316
Pdf link: https://arxiv.org/pdf/2109.06316
Abstract
Event mentions in text correspond to real-world events of varying degrees of granularity. The task of subevent detection aims to resolve this granularity issue, recognizing the membership of multi-granular events in event complexes. Since knowing the span of descriptive contexts of event complexes helps infer the membership of events, we propose the task of event-based text segmentation (EventSeg) as an auxiliary task to improve the learning for subevent detection. To bridge the two tasks together, we propose an approach to learning and enforcing constraints that capture dependencies between subevent detection and EventSeg prediction, as well as guiding the model to make globally consistent inference. Specifically, we adopt Rectifier Networks for constraint learning and then convert the learned constraints to a regularization term in the loss function of the neural model. Experimental results show that the proposed method outperforms baseline methods by 2.3% and 2.5% on benchmark datasets for subevent detection, HiEve and IC, respectively, while achieving a decent performance on EventSeg prediction.

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Mon, 18 Oct 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

Making Document-Level Information Extraction Right for the Right Reasons

Authors: Liyan Tang, Dhruv Rajan, Suyash Mohan, Abhijeet Pradhan, R. Nick Bryan, Greg Durrett
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2110.07686
Pdf link: https://arxiv.org/pdf/2110.07686
Abstract
Document-level information extraction is a flexible framework compatible with applications where information is not necessarily localized in a single sentence. For example, key features of a diagnosis in radiology a report may not be explicitly stated, but nevertheless can be inferred from the report's text. However, document-level neural models can easily learn spurious correlations from irrelevant information. This work studies how to ensure that these models make correct inferences from complex text and make those inferences in an auditable way: beyond just being right, are these models "right for the right reasons?" We experiment with post-hoc evidence extraction in a predict-select-verify framework using feature attribution techniques. While this basic approach can extract reasonable evidence, it can be regularized with small amounts of evidence supervision during training, which substantially improves the quality of extracted evidence. We evaluate on two domains: a small-scale labeled dataset of brain MRI reports and a large-scale modified version of DocRED (Yao et al., 2019) and show that models' plausibility can be improved with no loss in accuracy.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Tue, 21 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

Atrial Fibrillation: A Medical and Technological Review

Authors: Samayan Bhattacharya, Sk Shahnawaz
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2109.08974
Pdf link: https://arxiv.org/pdf/2109.08974
Abstract
Atrial Fibrillation (AF) is the most common type of arrhythmia (Greek a-, loss + rhythmos, rhythm = loss of rhythm) leading to hospitalization in the United States. Though sometimes AF is asymptomatic, it increases the risk of stroke and heart failure in patients, in addition to lowering the health-related quality of life (HRQOL). AF-related care costs the healthcare system between $6.0 to $26 billion each year. Early detection of AF and clinical attention can help improve symptoms and HRQOL of the patient, as well as bring down the cost of care. However, the prevalent paradigm of AF detection depends on electrocardiogram (ECG) recorded at a single point in time and does not shed light on the relation of the symptoms with heart rhythm or AF. In the recent decade, due to the democratization of health monitors and the advent of high-performing computers, Machine Learning algorithms have been proven effective in identifying AF, from the ECG of patients. This paper provides an overview of the symptoms of AF, its diagnosis, and future prospects for research in the field.

Keyword: Handwriting

Multiscale Manifold Warping

Authors: Sridhar Mahadevan, Anup Rao, Georgios Theocharous, Jennifer Healey
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2109.09222
Pdf link: https://arxiv.org/pdf/2109.09222
Abstract
Many real-world applications require aligning two temporal sequences, including bioinformatics, handwriting recognition, activity recognition, and human-robot coordination. Dynamic Time Warping (DTW) is a popular alignment method, but can fail on high-dimensional real-world data where the dimensions of aligned sequences are often unequal. In this paper, we show that exploiting the multiscale manifold latent structure of real-world data can yield improved alignment. We introduce a novel framework called Warping on Wavelets (WOW) that integrates DTW with a a multi-scale manifold learning framework called Diffusion Wavelets. We present a theoretical analysis of the WOW family of algorithms and show that it outperforms previous state of the art methods, such as canonical time warping (CTW) and manifold warping, on several real-world datasets.

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Mon, 11 Oct 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

KOHTD: Kazakh Offline Handwritten Text Dataset

Authors: Nazgul Toiganbayeva, Mahmoud Kasem, Galymzhan Abdimanap, Kairat Bostanbekov, Abdelrahman Abdallah, Anel Alimova, Daniyar Nurseitov
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.04075
Pdf link: https://arxiv.org/pdf/2110.04075
Abstract
Despite the transition to digital information exchange, many documents, such as invoices, taxes, memos and questionnaires, historical data, and answers to exam questions, still require handwritten inputs. In this regard, there is a need to implement Handwritten Text Recognition (HTR) which is an automatic way to decrypt records using a computer. Handwriting recognition is challenging because of the virtually infinite number of ways a person can write the same message. For this proposal we introduce Kazakh handwritten text recognition research, a comprehensive dataset of Kazakh handwritten texts is necessary. This is particularly true given the lack of a dataset for handwritten Kazakh text. In this paper, we proposed our extensive Kazakh offline Handwritten Text dataset (KOHTD), which has 3000 handwritten exam papers and more than 140335 segmented images and there are approximately 922010 symbols. It can serve researchers in the field of handwriting recognition tasks by using deep and machine learning. We used a variety of popular text recognition methods for word and line recognition in our studies, including CTC-based and attention-based methods. The findings demonstrate KOHTD's diversity. Also, we proposed a Genetic Algorithm (GA) for line and word segmentation based on random enumeration of a parameter. The dataset and GA code are available at https://github.com/abdoelsayed2016/KOHTD.

Keyword: OCR

There is no result

Keyword: Handwriting

KOHTD: Kazakh Offline Handwritten Text Dataset

Authors: Nazgul Toiganbayeva, Mahmoud Kasem, Galymzhan Abdimanap, Kairat Bostanbekov, Abdelrahman Abdallah, Anel Alimova, Daniyar Nurseitov
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.04075
Pdf link: https://arxiv.org/pdf/2110.04075
Abstract
Despite the transition to digital information exchange, many documents, such as invoices, taxes, memos and questionnaires, historical data, and answers to exam questions, still require handwritten inputs. In this regard, there is a need to implement Handwritten Text Recognition (HTR) which is an automatic way to decrypt records using a computer. Handwriting recognition is challenging because of the virtually infinite number of ways a person can write the same message. For this proposal we introduce Kazakh handwritten text recognition research, a comprehensive dataset of Kazakh handwritten texts is necessary. This is particularly true given the lack of a dataset for handwritten Kazakh text. In this paper, we proposed our extensive Kazakh offline Handwritten Text dataset (KOHTD), which has 3000 handwritten exam papers and more than 140335 segmented images and there are approximately 922010 symbols. It can serve researchers in the field of handwriting recognition tasks by using deep and machine learning. We used a variety of popular text recognition methods for word and line recognition in our studies, including CTC-based and attention-based methods. The findings demonstrate KOHTD's diversity. Also, we proposed a Genetic Algorithm (GA) for line and word segmentation based on random enumeration of a parameter. The dataset and GA code are available at https://github.com/abdoelsayed2016/KOHTD.

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Mon, 11 Oct 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

KOHTD: Kazakh Offline Handwritten Text Dataset

Authors: Nazgul Toiganbayeva, Mahmoud Kasem, Galymzhan Abdimanap, Kairat Bostanbekov, Abdelrahman Abdallah, Anel Alimova, Daniyar Nurseitov
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.04075
Pdf link: https://arxiv.org/pdf/2110.04075
Abstract
Despite the transition to digital information exchange, many documents, such as invoices, taxes, memos and questionnaires, historical data, and answers to exam questions, still require handwritten inputs. In this regard, there is a need to implement Handwritten Text Recognition (HTR) which is an automatic way to decrypt records using a computer. Handwriting recognition is challenging because of the virtually infinite number of ways a person can write the same message. For this proposal we introduce Kazakh handwritten text recognition research, a comprehensive dataset of Kazakh handwritten texts is necessary. This is particularly true given the lack of a dataset for handwritten Kazakh text. In this paper, we proposed our extensive Kazakh offline Handwritten Text dataset (KOHTD), which has 3000 handwritten exam papers and more than 140335 segmented images and there are approximately 922010 symbols. It can serve researchers in the field of handwriting recognition tasks by using deep and machine learning. We used a variety of popular text recognition methods for word and line recognition in our studies, including CTC-based and attention-based methods. The findings demonstrate KOHTD's diversity. Also, we proposed a Genetic Algorithm (GA) for line and word segmentation based on random enumeration of a parameter. The dataset and GA code are available at https://github.com/abdoelsayed2016/KOHTD.

Keyword: OCR

There is no result

Keyword: Handwriting

KOHTD: Kazakh Offline Handwritten Text Dataset

Authors: Nazgul Toiganbayeva, Mahmoud Kasem, Galymzhan Abdimanap, Kairat Bostanbekov, Abdelrahman Abdallah, Anel Alimova, Daniyar Nurseitov
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.04075
Pdf link: https://arxiv.org/pdf/2110.04075
Abstract
Despite the transition to digital information exchange, many documents, such as invoices, taxes, memos and questionnaires, historical data, and answers to exam questions, still require handwritten inputs. In this regard, there is a need to implement Handwritten Text Recognition (HTR) which is an automatic way to decrypt records using a computer. Handwriting recognition is challenging because of the virtually infinite number of ways a person can write the same message. For this proposal we introduce Kazakh handwritten text recognition research, a comprehensive dataset of Kazakh handwritten texts is necessary. This is particularly true given the lack of a dataset for handwritten Kazakh text. In this paper, we proposed our extensive Kazakh offline Handwritten Text dataset (KOHTD), which has 3000 handwritten exam papers and more than 140335 segmented images and there are approximately 922010 symbols. It can serve researchers in the field of handwriting recognition tasks by using deep and machine learning. We used a variety of popular text recognition methods for word and line recognition in our studies, including CTC-based and attention-based methods. The findings demonstrate KOHTD's diversity. Also, we proposed a Genetic Algorithm (GA) for line and word segmentation based on random enumeration of a parameter. The dataset and GA code are available at https://github.com/abdoelsayed2016/KOHTD.

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Thu, 19 Aug 21

Keyword: Text Detection

There is no result

Keyword: Text

Higher-Order Concurrency for Microcontrollers

Authors: Abhiroop Sarkar, Robert Krook, Bo Joel Svensson, Mary Sheeran
Subjects: Programming Languages (cs.PL); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2108.07805
Pdf link: https://arxiv.org/pdf/2108.07805
Abstract
Programming microcontrollers involves low-level interfacing with hardware and peripherals that are concurrent and reactive. Such programs are typically written in a mixture of C and assembly using concurrent language extensions (like $\texttt{FreeRTOS tasks}$ and $\texttt{semaphores}$), resulting in unsafe, callback-driven, error-prone and difficult-to-maintain code. We address this challenge by introducing $\texttt{SenseVM}$ - a bytecode-interpreted virtual machine that provides a message-passing based $\textit{higher-order concurrency}$ model, originally introduced by Reppy, for microcontroller programming. This model treats synchronous operations as first-class values (called $\texttt{Events}$) akin to the treatment of first-class functions in functional languages. This primarily allows the programmer to compose and tailor their own concurrency abstractions and, additionally, abstracts away unsafe memory operations, common in shared-memory concurrency models, thereby making microcontroller programs safer, composable and easier-to-maintain. Our VM is made portable via a low-level $\textit{bridge}$ interface, built atop the embedded OS - Zephyr. The bridge is implemented by all drivers and designed such that programming in response to a software message or a hardware interrupt remains uniform and indistinguishable. In this paper we demonstrate the features of our VM through an example, written in a Caml-like functional language, running on the $\texttt{nRF52840}$ and $\texttt{STM32F4}$ microcontrollers.

ARCH++: Animation-Ready Clothed Human Reconstruction Revisited

Authors: Tong He, Yuanlu Xu, Shunsuke Saito, Stefano Soatto, Tony Tung
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Arxiv link: https://arxiv.org/abs/2108.07845
Pdf link: https://arxiv.org/pdf/2108.07845
Abstract
We present ARCH++, an image-based method to reconstruct 3D avatars with arbitrary clothing styles. Our reconstructed avatars are animation-ready and highly realistic, in both the visible regions from input views and the unseen regions. While prior work shows great promise of reconstructing animatable clothed humans with various topologies, we observe that there exist fundamental limitations resulting in sub-optimal reconstruction quality. In this paper, we revisit the major steps of image-based avatar reconstruction and address the limitations with ARCH++. First, we introduce an end-to-end point based geometry encoder to better describe the semantics of the underlying 3D human body, in replacement of previous hand-crafted features. Second, in order to address the occupancy ambiguity caused by topological changes of clothed humans in the canonical pose, we propose a co-supervising framework with cross-space consistency to jointly estimate the occupancy in both the posed and canonical spaces. Last, we use image-to-image translation networks to further refine detailed geometry and texture on the reconstructed surface, which improves the fidelity and consistency across arbitrary viewpoints. In the experiments, we demonstrate improvements over the state of the art on both public benchmarks and user studies in reconstruction quality and realism.

Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net

Authors: Yu Qiu, Yun Liu, Le Zhang, Jing Xu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2108.07851
Pdf link: https://arxiv.org/pdf/2108.07851
Abstract
Existing salient object detection (SOD) models mainly rely on CNN-based U-shaped structures with skip connections to combine the global contexts and local spatial details that are crucial for locating salient objects and refining object details, respectively. Despite great successes, the ability of CNN in learning global contexts is limited. Recently, the vision transformer has achieved revolutionary progress in computer vision owing to its powerful modeling of global dependencies. However, directly applying the transformer to SOD is obviously suboptimal because the transformer lacks the ability to learn local spatial representations. To this end, this paper explores the combination of transformer and CNN to learn both global and local representations for SOD. We propose a transformer-based Asymmetric Bilateral U-Net (AbiU-Net). The asymmetric bilateral encoder has a transformer path and a lightweight CNN path, where the two paths communicate at each encoder stage to learn complementary global contexts and local spatial details, respectively. The asymmetric bilateral decoder also consists of two paths to process features from the transformer and CNN encoder paths, with communication at each decoder stage for decoding coarse salient object locations and find-grained object details, respectively. Such communication between the two encoder/decoder paths enables AbiU-Net to learn complementary global and local representations, taking advantage of the natural properties of transformer and CNN, respectively. Hence, ABiU-Net provides a new perspective for transformer-based SOD. Extensive experiments demonstrate that ABiU-Net performs favorably against previous state-of-the-art SOD methods. The code will be released.

Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021): Workshop and Shared Task Report

Authors: Ali Hürriyetoğlu, Hristo Tanev, Vanni Zavarella, Jakub Piskorski, Reyyan Yeniterzi, Erdem Yörük
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2108.07865
Pdf link: https://arxiv.org/pdf/2108.07865
Abstract
This workshop is the fourth issue of a series of workshops on automatic extraction of socio-political events from news, organized by the Emerging Market Welfare Project, with the support of the Joint Research Centre of the European Commission and with contributions from many other prominent scholars in this field. The purpose of this series of workshops is to foster research and development of reliable, valid, robust, and practical solutions for automatically detecting descriptions of socio-political events, such as protests, riots, wars and armed conflicts, in text streams. This year workshop contributors make use of the state-of-the-art NLP technologies, such as Deep Learning, Word Embeddings and Transformers and cover a wide range of topics from text classification to news bias detection. Around 40 teams have registered and 15 teams contributed to three tasks that are i) multilingual protest news detection, ii) fine-grained classification of socio-political events, and iii) discovering Black Lives Matter protest events. The workshop also highlights two keynote and four invited talks about various aspects of creating event data sets and multi- and cross-lingual machine learning in few- and zero-shot settings.

Contextualizing Variation in Text Style Transfer Datasets

Authors: Stephanie Schoch, Wanyu Du, Yangfeng Ji
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2108.07871
Pdf link: https://arxiv.org/pdf/2108.07871
Abstract
Text style transfer involves rewriting the content of a source sentence in a target style. Despite there being a number of style tasks with available data, there has been limited systematic discussion of how text style datasets relate to each other. This understanding, however, is likely to have implications for selecting multiple data sources for model training. While it is prudent to consider inherent stylistic properties when determining these relationships, we also must consider how a style is realized in a particular dataset. In this paper, we conduct several empirical analyses of existing text style datasets. Based on our results, we propose a categorization of stylistic and dataset properties to consider when utilizing or comparing text style datasets.

Modulating Language Models with Emotions

Authors: Ruibo Liu, Jason Wei, Chenyan Jia, Soroush Vosoughi
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2108.07886
Pdf link: https://arxiv.org/pdf/2108.07886
Abstract
Generating context-aware language that embodies diverse emotions is an important step towards building empathetic NLP systems. In this paper, we propose a formulation of modulated layer normalization -- a technique inspired by computer vision -- that allows us to use large-scale language models for emotional response generation. In automatic and human evaluation on the MojiTalk dataset, our proposed modulated layer normalization method outperforms prior baseline methods while maintaining diversity, fluency, and coherence. Our method also obtains competitive performance even when using only 10% of the available training data.

On the Virality of Animated GIFs on Tumblr

Authors: Yunseok Jang, Yale Song, Gunhee Kim
Subjects: Social and Information Networks (cs.SI)
Arxiv link: https://arxiv.org/abs/2108.07894
Pdf link: https://arxiv.org/pdf/2108.07894
Abstract
Animated GIFs are becoming increasingly popular in online communication. People use them to express emotion, share their interests and enhance (or even replace) short-form texting; they are a new means to tell visual stories. Some creative animated GIFs are highly addictive to watch, and eventually become viral -- they circulate rapidly and widely within the network. What makes certain animated GIFs go viral? In this paper, we study the virality of animated GIFs by analyzing over 10 months of complete data logs (more than 1B posts and 12B reblogs) on Tumblr, one of the largest repositories of animated GIFs on the Internet. We conduct a series of quantitative and comparative studies on Tumblr data, comparing major types of online content -- text, images, videos, and animated GIFs. We report on a number of interesting, new findings on animated GIFs. We show that people tend to make animated GIFs easily searchable and discoverable by adding more hashtags than other content types. We also show that animated GIFs tend to go more viral than images and videos on Tumblr. With more in-depth analysis, we present that animated GIFs tend to get reblogged more and followed more from non-followers, while animated GIFs have more recurrence of a post. Lastly, we show that the virality of animated GIFs is more easily predictable than that of images and videos.

Affect-Aware Deep Belief Network Representations for Multimodal Unsupervised Deception Detection

Authors: Leena Mathur, Maja J Matarić
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2108.07897
Pdf link: https://arxiv.org/pdf/2108.07897
Abstract
Automated systems that detect the social behavior of deception can enhance human well-being across medical, social work, and legal domains. Labeled datasets to train supervised deception detection models can rarely be collected for real-world, high-stakes contexts. To address this challenge, we propose the first unsupervised approach for detecting real-world, high-stakes deception in videos without requiring labels. This paper presents our novel approach for affect-aware unsupervised Deep Belief Networks (DBN) to learn discriminative representations of deceptive and truthful behavior. Drawing on psychology theories that link affect and deception, we experimented with unimodal and multimodal DBN-based approaches trained on facial valence, facial arousal, audio, and visual features. In addition to using facial affect as a feature on which DBN models are trained, we also introduce a DBN training procedure that uses facial affect as an aligner of audio-visual representations. We conducted classification experiments with unsupervised Gaussian Mixture Model clustering to evaluate our approaches. Our best unsupervised approach (trained on facial valence and visual features) achieved an AUC of 80%, outperforming human ability and performing comparably to fully-supervised models. Our results motivate future work on unsupervised, affect-aware computational approaches for detecting deception and other social behaviors in the wild.

Assessing the Integration of Software Agents and Industrial Automation Systems with ISO/IEC 25010

Authors: Stamatis Karnouskos, Roopak Sinha, Paulo Leitão, Luis Ribeiro, Thomas. I. Strasser
Subjects: Software Engineering (cs.SE); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2108.07933
Pdf link: https://arxiv.org/pdf/2108.07933
Abstract
Agent-technologies have been used for higher-level decision making in addition to carrying out lower-level automation and control functions in industrial systems. Recent research has identified a number of architectural patterns for the use of agents in industrial automation systems but these practices vary in several ways, including how closely agents are coupled with physical systems and their control functions. Such practices may play a pivotal role in the Cyber-Physical System integration and interaction. Hence, there is a clear need for a common set of criteria for assessing available practices and identifying a best-fit practice for a given industrial use case. Unfortunately, no such common criteria exist currently. This work proposes an assessment criteria approach as well as a methodology to enable the use case based selection of a best practice for integrating agents and industrial systems. The software product quality model proposed by the ISO/IEC 25010 family of standards is used as starting point and is put in the industrial automation context. Subsequently, the proposed methodology is applied, and a survey of experts in the domain is carried out, in order to reveal some insights on the key characteristics of the subject matter.

Learning Implicit User Profiles for Personalized Retrieval-Based Chatbot

Authors: Hongjin Qian, Zhicheng Dou, Yutao Zhu, Yueyuan Ma, Ji-Rong Wen
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2108.07935
Pdf link: https://arxiv.org/pdf/2108.07935
Abstract
In this paper, we explore the problem of developing personalized chatbots. A personalized chatbot is designed as a digital chatting assistant for a user. The key characteristic of a personalized chatbot is that it should have a consistent personality with the corresponding user. It can talk the same way as the user when it is delegated to respond to others' messages. We present a retrieval-based personalized chatbot model, namely IMPChat, to learn an implicit user profile from the user's dialogue history. We argue that the implicit user profile is superior to the explicit user profile regarding accessibility and flexibility. IMPChat aims to learn an implicit user profile through modeling user's personalized language style and personalized preferences separately. To learn a user's personalized language style, we elaborately build language models from shallow to deep using the user's historical responses; To model a user's personalized preferences, we explore the conditional relations underneath each post-response pair of the user. The personalized preferences are dynamic and context-aware: we assign higher weights to those historical pairs that are topically related to the current query when aggregating the personalized preferences. We match each response candidate with the personalized language style and personalized preference, respectively, and fuse the two matching signals to determine the final ranking score. Comprehensive experiments on two large datasets show that our method outperforms all baseline models.

FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning

Authors: Chenxu Zhang, Yifan Zhao, Yifei Huang, Ming Zeng, Saifeng Ni, Madhukar Budagavi, Xiaohu Guo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2108.07938
Pdf link: https://arxiv.org/pdf/2108.07938
Abstract
In this paper, we propose a talking face generation method that takes an audio signal as input and a short target video clip as reference, and synthesizes a photo-realistic video of the target face with natural lip motions, head poses, and eye blinks that are in-sync with the input audio signal. We note that the synthetic face attributes include not only explicit ones such as lip motions that have high correlations with speech, but also implicit ones such as head poses and eye blinks that have only weak correlation with the input audio. To model such complicated relationships among different face attributes with input audio, we propose a FACe Implicit Attribute Learning Generative Adversarial Network (FACIAL-GAN), which integrates the phonetics-aware, context-aware, and identity-aware information to synthesize the 3D face animation with realistic motions of lips, head poses, and eye blinks. Then, our Rendering-to-Video network takes the rendered face images and the attention map of eye blinks as input to generate the photo-realistic output video frames. Experimental results and user studies show our method can generate realistic talking face videos with not only synchronized lip motions, but also natural head movements and eye blinks, with better qualities than the results of state-of-the-art methods.

Object Disparity

Authors: Ynjiun Paul Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2108.07939
Pdf link: https://arxiv.org/pdf/2108.07939
Abstract
Most of stereo vision works are focusing on computing the dense pixel disparity of a given pair of left and right images. A camera pair usually required lens undistortion and stereo calibration to provide an undistorted epipolar line calibrated image pair for accurate dense pixel disparity computation. Due to noise, object occlusion, repetitive or lack of texture and limitation of matching algorithms, the pixel disparity accuracy usually suffers the most at those object boundary areas. Although statistically the total number of pixel disparity errors might be low (under 2% according to the Kitti Vision Benchmark of current top ranking algorithms), the percentage of these disparity errors at object boundaries are very high. This renders the subsequence 3D object distance detection with much lower accuracy than desired. This paper proposed a different approach for solving a 3D object distance detection by detecting object disparity directly without going through a dense pixel disparity computation. An example squeezenet Object Disparity-SSD (OD-SSD) was constructed to demonstrate an efficient object disparity detection with comparable accuracy compared with Kitti dataset pixel disparity ground truth. Further training and testing results with mixed image dataset captured by several different stereo systems may suggest that an OD-SSD might be agnostic to stereo system parameters such as a baseline, FOV, lens distortion, even left/right camera epipolar line misalignment.

Self-Supervised Visual Representations Learning by Contrastive Mask Prediction

Authors: Yucheng Zhao, Guangting Wang, Chong Luo, Wenjun Zeng, Zheng-Jun Zha
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2108.07954
Pdf link: https://arxiv.org/pdf/2108.07954
Abstract
Advanced self-supervised visual representation learning methods rely on the instance discrimination (ID) pretext task. We point out that the ID task has an implicit semantic consistency (SC) assumption, which may not hold in unconstrained datasets. In this paper, we propose a novel contrastive mask prediction (CMP) task for visual representation learning and design a mask contrast (MaskCo) framework to implement the idea. MaskCo contrasts region-level features instead of view-level features, which makes it possible to identify the positive sample without any assumptions. To solve the domain gap between masked and unmasked features, we design a dedicated mask prediction head in MaskCo. This module is shown to be the key to the success of the CMP. We evaluated MaskCo on training datasets beyond ImageNet and compare its performance with MoCo V2. Results show that MaskCo achieves comparable performance with MoCo V2 using ImageNet training dataset, but demonstrates a stronger performance across a range of downstream tasks when COCO or Conceptual Captions are used for training. MaskCo provides a promising alternative to the ID-based methods for self-supervised learning in the wild.

De-identification of Unstructured Clinical Texts from Sequence to Sequence Perspective

Authors: Md Monowar Anjum, Noman Mohammed, Xiaoqian Jiang
Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2108.07971
Pdf link: https://arxiv.org/pdf/2108.07971
Abstract
In this work, we propose a novel problem formulation for de-identification of unstructured clinical text. We formulate the de-identification problem as a sequence to sequence learning problem instead of a token classification problem. Our approach is inspired by the recent state-of -the-art performance of sequence to sequence learning models for named entity recognition. Early experimentation of our proposed approach achieved 98.91% recall rate on i2b2 dataset. This performance is comparable to current state-of-the-art models for unstructured clinical text de-identification.

EviDR: Evidence-Emphasized Discrete Reasoning for Reasoning Machine Reading Comprehension

Authors: Yongwei Zhou, Junwei Bao, Haipeng Sun, Jiahui Liang, Youzheng Wu, Xiaodong He, Bowen Zhou, Tiejun Zhao
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2108.07994
Pdf link: https://arxiv.org/pdf/2108.07994
Abstract
Reasoning machine reading comprehension (R-MRC) aims to answer complex questions that require discrete reasoning based on text. To support discrete reasoning, evidence, typically the concise textual fragments that describe question-related facts, including topic entities and attribute values, are crucial clues from question to answer. However, previous end-to-end methods that achieve state-of-the-art performance rarely solve the problem by paying enough emphasis on the modeling of evidence, missing the opportunity to further improve the model's reasoning ability for R-MRC. To alleviate the above issue, in this paper, we propose an evidence-emphasized discrete reasoning approach (EviDR), in which sentence and clause level evidence is first detected based on distant supervision, and then used to drive a reasoning module implemented with a relational heterogeneous graph convolutional network to derive answers. Extensive experiments are conducted on DROP (discrete reasoning over paragraphs) dataset, and the results demonstrate the effectiveness of our proposed approach. In addition, qualitative analysis verifies the capability of the proposed evidence-emphasized discrete reasoning for R-MRC.

GGP: A Graph-based Grouping Planner for Explicit Control of Long Text Generation

Authors: Xuming Lin, Shaobo Cui, Zhongzhou Zhao, Wei Zhou, Ji Zhang, Haiqing Chen
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2108.07998
Pdf link: https://arxiv.org/pdf/2108.07998
Abstract
Existing data-driven methods can well handle short text generation. However, when applied to the long-text generation scenarios such as story generation or advertising text generation in the commercial scenario, these methods may generate illogical and uncontrollable texts. To address these aforementioned issues, we propose a graph-based grouping planner(GGP) following the idea of first-plan-then-generate. Specifically, given a collection of key phrases, GGP firstly encodes these phrases into an instance-level sequential representation and a corpus-level graph-based representation separately. With these two synergic representations, we then regroup these phrases into a fine-grained plan, based on which we generate the final long text. We conduct our experiments on three long text generation datasets and the experimental results reveal that GGP significantly outperforms baselines, which proves that GGP can control the long text generation by knowing how to say and in what order.

Deep Hybrid Self-Prior for Full 3D Mesh Generation

Authors: Xingkui Wei, Zhengqing Chen, Yanwei Fu, Zhaopeng Cui, Yinda Zhang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2108.08017
Pdf link: https://arxiv.org/pdf/2108.08017
Abstract
We present a deep learning pipeline that leverages network self-prior to recover a full 3D model consisting of both a triangular mesh and a texture map from the colored 3D point cloud. Different from previous methods either exploiting 2D self-prior for image editing or 3D self-prior for pure surface reconstruction, we propose to exploit a novel hybrid 2D-3D self-prior in deep neural networks to significantly improve the geometry quality and produce a high-resolution texture map, which is typically missing from the output of commodity-level 3D scanners. In particular, we first generate an initial mesh using a 3D convolutional neural network with 3D self-prior, and then encode both 3D information and color information in the 2D UV atlas, which is further refined by 2D convolutional neural networks with the self-prior. In this way, both 2D and 3D self-priors are utilized for the mesh and texture recovery. Experiments show that, without the need of any additional training data, our method recovers the 3D textured mesh model of high quality from sparse input, and outperforms the state-of-the-art methods in terms of both the geometry and texture quality.

Worst-Case Efficient Dynamic Geometric Independent Set

Authors: Jean Cardinal, John Iacono, John Iacono
Subjects: Computational Geometry (cs.CG); Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2108.08050
Pdf link: https://arxiv.org/pdf/2108.08050
Abstract
We consider the problem of maintaining an approximate maximum independent set of geometric objects under insertions and deletions. We present data structures that maintain a constant-factor approximate maximum independent set for broad classes of fat objects in $d$ dimensions, where $d$ is assumed to be a constant, in sublinear \textit{worst-case} update time. This gives the first results for dynamic independent set in a wide variety of geometric settings, such as disks, fat polygons, and their high-dimensional equivalents. For axis-aligned squares and hypercubes, our result improves upon all (recently announced) previous works. We obtain, in particular, a dynamic $(4+\epsilon)$-approximation for squares, with $O(\log^4 n)$ worst-case update time. Our result is obtained via a two-level approach. First, we develop a dynamic data structure which stores all objects and provides an approximate independent set when queried, with output-sensitive running time. We show that via standard methods such a structure can be used to obtain a dynamic algorithm with \textit{amortized} update time bounds. Then, to obtain worst-case update time algorithms, we develop a generic deamortization scheme that with each insertion/deletion keeps (i) the update time bounded and (ii) the number of changes in the independent set constant. We show that such a scheme is applicable to fat objects by showing an appropriate generalization of a separator theorem. Interestingly, we show that our deamortization scheme is also necessary in order to obtain worst-case update bounds: If for a class of objects our scheme is not applicable, then no constant-factor approximation with sublinear worst-case update time is possible. We show that such a lower bound applies even for seemingly simple classes of geometric objects including axis-aligned rectangles in the plane.

More Than React: Investigating The Role of EmojiReaction in GitHub Pull Requests

Authors: Teyon Son, Tao Xiao, Dong Wang, Raula Gaikovina Kula, Takashi Ishio, Kenichi Matsumoto
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2108.08094
Pdf link: https://arxiv.org/pdf/2108.08094
Abstract
Context: Open source software development has become more social and collaborative, especially with the rise of social coding platforms like GitHub. Since 2016, GitHub started to support more informal methods such as emoji reactions, with the goal to reduce commenting noise when reviewing any code changes to a repository. Interestingly, preliminary results indicate that emojis do not always reduce commenting noise (i.e., eight out of 20 emoji reactions), providing evidence that developers use emojis with ulterior intentions. From a reviewing context, the extent to which emoji reactions facilitate for a more efficient review process is unknown. Objective: In this registered report, we introduce the study protocols to investigate ulterior intentions and usages of emoji reactions, apart from reducing commenting noise during the discussions in GitHub pull requests (PRs). As part of the report, we first perform a preliminary analysis to whether emoji reactions can reduce commenting noise in PRs and then introduce the execution plan for the study. Method: We will use a mixed-methods approach in this study, i.e., quantitative and qualitative, with three hypotheses to test.

AdapterHub Playground: Simple and Flexible Few-Shot Learning with Adapters

Authors: Tilman Beck, Bela Bohlender, Christina Viehmann, Vincent Hane, Yanik Adamson, Jaber Khuri, Jonas Brossmann, Jonas Pfeiffer, Iryna Gurevych
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2108.08103
Pdf link: https://arxiv.org/pdf/2108.08103
Abstract
The open-access dissemination of pretrained language models through online repositories has led to a democratization of state-of-the-art natural language processing (NLP) research. This also allows people outside of NLP to use such models and adapt them to specific use-cases. However, a certain amount of technical proficiency is still required which is an entry barrier for users who want to apply these models to a certain task but lack the necessary knowledge or resources. In this work, we aim to overcome this gap by providing a tool which allows researchers to leverage pretrained models without writing a single line of code. Built upon the parameter-efficient adapter modules for transfer learning, our AdapterHub Playground provides an intuitive interface, allowing the usage of adapters for prediction, training and analysis of textual data for a variety of NLP tasks. We present the tool's architecture and demonstrate its advantages with prototypical use-cases, where we show that predictive performance can easily be increased in a few-shot learning scenario. Finally, we evaluate its usability in a user study. We provide the code and a live interface at https://adapter-hub.github.io/playground.

Image Collation: Matching illustrations in manuscripts

Authors: Ryad Kaoua, Xi Shen, Alexandra Durr, Stavros Lazaris, David Picard, Mathieu Aubry
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2108.08109
Pdf link: https://arxiv.org/pdf/2108.08109
Abstract
Illustrations are an essential transmission instrument. For an historian, the first step in studying their evolution in a corpus of similar manuscripts is to identify which ones correspond to each other. This image collation task is daunting for manuscripts separated by many lost copies, spreading over centuries, which might have been completely re-organized and greatly modified to adapt to novel knowledge or belief and include hundreds of illustrations. Our contributions in this paper are threefold. First, we introduce the task of illustration collation and a large annotated public dataset to evaluate solutions, including 6 manuscripts of 2 different texts with more than 2 000 illustrations and 1 200 annotated correspondences. Second, we analyze state of the art similarity measures for this task and show that they succeed in simple cases but struggle for large manuscripts when the illustrations have undergone very significant changes and are discriminated only by fine details. Finally, we show clear evidence that significant performance boosts can be expected by exploiting cycle-consistent correspondences. Our code and data are available on this http URL

Table Caption Generation in Scholarly Documents Leveraging Pre-trained Language Models

Authors: Junjie H. Xu, Kohei Shinden, Makoto P. Kato
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2108.08111
Pdf link: https://arxiv.org/pdf/2108.08111
Abstract
This paper addresses the problem of generating table captions for scholarly documents, which often require additional information outside the table. To this end, we propose a method of retrieving relevant sentences from the paper body, and feeding the table content as well as the retrieved sentences into pre-trained language models (e.g. T5 and GPT-2) for generating table captions. The contributions of this paper are: (1) discussion on the challenges in table captioning for scholarly documents; (2) development of a dataset DocBank-TB, which is publicly available; and (3) comparison of caption generation methods for scholarly documents with different strategies to retrieve relevant sentences from the paper body. Our experimental results showed that T5 is the better generation model for this task, as it outperformed GPT-2 in BLEU and METEOR implying that the generated text are clearer and more precise. Moreover, inputting relevant sentences matching the row header or whole table is effective.

Fighting Game Commentator with Pitch and Loudness Adjustment Utilizing Highlight Cues

Authors: Junjie H. Xu, Zhou Fang, Qihang Chen, Satoru Ohno, Pujana Paliyawan
Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2108.08112
Pdf link: https://arxiv.org/pdf/2108.08112
Abstract
This paper presents a commentator for providing real-time game commentary in a fighting game. The commentary takes into account highlight cues, obtained by analyzing scenes during gameplay, as input to adjust the pitch and loudness of commentary to be spoken by using a Text-to-Speech (TTS) technology. We investigate different designs for pitch and loudness adjustment. The proposed AI consists of two parts: a dynamic adjuster for controlling pitch and loudness of the TTS and a real-time game commentary generator. We conduct a pilot study on a fighting game, and our result shows that by adjusting the loudness significantly according to the level of game highlight, the entertainment of the gameplay can be enhanced.

Target Adaptive Context Aggregation for Video Scene Graph Generation

Authors: Yao Teng, Limin Wang, Zhifeng Li, Gangshan Wu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2108.08121
Pdf link: https://arxiv.org/pdf/2108.08121
Abstract
This paper deals with a challenging task of video scene graph generation (VidSGG), which could serve as a structured video representation for high-level understanding tasks. We present a new {\em detect-to-track} paradigm for this task by decoupling the context modeling for relation prediction from the complicated low-level entity tracking. Specifically, we design an efficient method for frame-level VidSGG, termed as {\em Target Adaptive Context Aggregation Network} (TRACE), with a focus on capturing spatio-temporal context information for relation recognition. Our TRACE framework streamlines the VidSGG pipeline with a modular design, and presents two unique blocks of Hierarchical Relation Tree (HRTree) construction and Target-adaptive Context Aggregation. More specific, our HRTree first provides an adpative structure for organizing possible relation candidates efficiently, and guides context aggregation module to effectively capture spatio-temporal structure information. Then, we obtain a contextualized feature representation for each relation candidate and build a classification head to recognize its relation category. Finally, we provide a simple temporal association strategy to track TRACE detected results to yield the video-level VidSGG. We perform experiments on two VidSGG benchmarks: ImageNet-VidVRD and Action Genome, and the results demonstrate that our TRACE achieves the state-of-the-art performance. The code and models are made available at \url{https://github.com/MCG-NJU/TRACE}.

Multi-Variant Execution at the Edge

Authors: Javier Cabrera Arteaga, Pierre Laperdrix, Martin Monperrus, Benoit Baudry
Subjects: Software Engineering (cs.SE)
Arxiv link: https://arxiv.org/abs/2108.08125
Pdf link: https://arxiv.org/pdf/2108.08125
Abstract
Edge-cloud computing offloads parts of the computations that traditionally occurs in the cloud to edge nodes,e.g., CDN servers, in order to get closer to the users and reduce latency. To improve performance even further, WebAssembly is increasingly used in this context. Edge-cloud computing providers, such as Fastly or Cloudflare, let their clients deploy stateless services in the form of WebAssembly binaries, which are then translated to machine code and sandboxed for a safe execution at the edge. In this context, we propose a technique that (i) automatically diversifies WebAssembly binaries that are deployed to the edge and (ii) randomizes execution paths at runtime, turning the execution of the services into a moving target. Given a service tobe deployed at the edge, we automatically synthesize functionally equivalent variants for the functions that implement the service.All the variants are then wrapped into a single multivariant WebAssembly binary. When the service endpoint is executed,every time a function is invoked, one of its variants is randomly selected. We implement this technique in the MEWE tool and we validate it with 7 services for cryptography and QR encoding. MEWE generates multivariant binaries that embed hundreds of function variants. We execute the multivariant binaries on the worldwide edge platform provided by Fastly. We show that,at runtime, the multivariant exhibit a remarkable diversity ofexecution traces, across the whole edge platform.

RTE: A Tool for Annotating Relation Triplets from Text

Authors: Ankan Mullick, Animesh Bera, Tapas Nayak
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2108.08184
Pdf link: https://arxiv.org/pdf/2108.08184
Abstract
In this work, we present a Web-based annotation tool `Relation Triplets Extractor' \footnote{https://abera87.github.io/annotate/} (RTE) for annotating relation triplets from the text. Relation extraction is an important task for extracting structured information about real-world entities from the unstructured text available on the Web. In relation extraction, we focus on binary relation that refers to relations between two entities. Recently, many supervised models are proposed to solve this task, but they mostly use noisy training data obtained using the distant supervision method. In many cases, evaluation of the models is also done based on a noisy test dataset. The lack of annotated clean dataset is a key challenge in this area of research. In this work, we built a web-based tool where researchers can annotate datasets for relation extraction on their own very easily. We use a server-less architecture for this tool, and the entire annotation operation is processed using client-side code. Thus it does not suffer from any network latency, and the privacy of the user's data is also maintained. We hope that this tool will be beneficial for the researchers to advance the field of relation extraction.

Transformers predicting the future. Applying attention in next-frame and time series forecasting

Authors: Radostin Cholakov, Todor Kolev
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2108.08224
Pdf link: https://arxiv.org/pdf/2108.08224
Abstract
Recurrent Neural Networks were, until recently, one of the best ways to capture the timely dependencies in sequences. However, with the introduction of the Transformer, it has been proven that an architecture with only attention-mechanisms without any RNN can improve on the results in various sequence processing tasks (e.g. NLP). Multiple studies since then have shown that similar approaches can be applied for images, point clouds, video, audio or time series forecasting. Furthermore, solutions such as the Perceiver or the Informer have been introduced to expand on the applicability of the Transformer. Our main objective is testing and evaluating the effectiveness of applying Transformer-like models on time series data, tackling susceptibility to anomalies, context awareness and space complexity by fine-tuning the hyperparameters, preprocessing the data, applying dimensionality reduction or convolutional encodings, etc. We are also looking at the problem of next-frame prediction and exploring ways to modify existing solutions in order to achieve higher performance and learn generalized knowledge.

TSI: an Ad Text Strength Indicator using Text-to-CTR and Semantic-Ad-Similarity

Authors: Shaunak Mishra, Changwei Hu, Manisha Verma, Kevin Yen, Yifan Hu, Maxim Sviridenko
Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2108.08226
Pdf link: https://arxiv.org/pdf/2108.08226
Abstract
Coming up with effective ad text is a time consuming process, and particularly challenging for small businesses with limited advertising experience. When an inexperienced advertiser onboards with a poorly written ad text, the ad platform has the opportunity to detect low performing ad text, and provide improvement suggestions. To realize this opportunity, we propose an ad text strength indicator (TSI) which: (i) predicts the click-through-rate (CTR) for an input ad text, (ii) fetches similar existing ads to create a neighborhood around the input ad, (iii) and compares the predicted CTRs in the neighborhood to declare whether the input ad is strong or weak. In addition, as suggestions for ad text improvement, TSI shows anonymized versions of superior ads (higher predicted CTR) in the neighborhood. For (i), we propose a BERT based text-to-CTR model trained on impressions and clicks associated with an ad text. For (ii), we propose a sentence-BERT based semantic-ad-similarity model trained using weak labels from ad campaign setup data. Offline experiments demonstrate that our BERT based text-to-CTR model achieves a significant lift in CTR prediction AUC for cold start (new) advertisers compared to bag-of-words based baselines. In addition, our semantic-textual-similarity model for similar ads retrieval achieves a precision@1 of 0.93 (for retrieving ads from the same product category); this is significantly higher compared to unsupervised TF-IDF, word2vec, and sentence-BERT baselines. Finally, we share promising online results from advertisers in the Yahoo (Verizon Media) ad platform where a variant of TSI was implemented with sub-second end-to-end latency.

Analogical Learning in Tactical Decision Games

Authors: Tom Hinrichs, Greg Dunham, Ken Forbus
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2108.08227
Pdf link: https://arxiv.org/pdf/2108.08227
Abstract
Tactical Decision Games (TDGs) are military conflict scenarios presented both textually and graphically on a map. These scenarios provide a challenging domain for machine learning because they are open-ended, highly structured, and typically contain many details of varying relevance. We have developed a problem-solving component of an interactive companion system that proposes military tasks to solve TDG scenarios using a combination of analogical retrieval, mapping, and constraint propagation. We use this problem-solving component to explore analogical learning. In this paper, we describe the problems encountered in learning for this domain, and the methods we have developed to address these, such as partition constraints on analogical mapping correspondences and the use of incremental remapping to improve robustness. We present the results of learning experiments that show improvement in performance through the simple accumulation of examples, despite a weak domain theory.

Streaming and Learning the Personal Context

Authors: Fausto Giunchiglia, Marcelo Rodas Britez, Andrea Bontempelli, Xiaoyue Li
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2108.08234
Pdf link: https://arxiv.org/pdf/2108.08234
Abstract
The representation of the personal context is complex and essential to improve the help machines can give to humans for making sense of the world, and the help humans can give to machines to improve their efficiency. We aim to design a novel model representation of the personal context and design a learning process for better integration with machine learning. We aim to implement these elements into a modern system architecture focus in real-life environments. Also, we show how our proposal can improve in specifically related work papers. Finally, we are moving forward with a better personal context representation with an improved model, the implementation of the learning process, and the architectural design of these components.

LOKI: Long Term and Key Intentions for Trajectory Prediction

Authors: Harshayu Girase, Haiming Gang, Srikanth Malla, Jiachen Li, Akira Kanehara, Karttikeya Mangalam, Chiho Choi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2108.08236
Pdf link: https://arxiv.org/pdf/2108.08236
Abstract
Recent advances in trajectory prediction have shown that explicit reasoning about agents' intent is important to accurately forecast their motion. However, the current research activities are not directly applicable to intelligent and safety critical systems. This is mainly because very few public datasets are available, and they only consider pedestrian-specific intents for a short temporal horizon from a restricted egocentric view. To this end, we propose LOKI (LOng term and Key Intentions), a novel large-scale dataset that is designed to tackle joint trajectory and intention prediction for heterogeneous traffic agents (pedestrians and vehicles) in an autonomous driving setting. The LOKI dataset is created to discover several factors that may affect intention, including i) agent's own will, ii) social interactions, iii) environmental constraints, and iv) contextual information. We also propose a model that jointly performs trajectory and intention prediction, showing that recurrently reasoning about intention can assist with trajectory prediction. We show our method outperforms state-of-the-art trajectory prediction methods by upto $27%$ and also provide a baseline for frame-wise intention estimation.

Aurora: a probabilistic algorithm for distributed ledgers enabling trustless synchronization and transaction inclusion verification

Authors: Federico Matteo Benčić, Ivana Podnar Žarko
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2108.08272
Pdf link: https://arxiv.org/pdf/2108.08272
Abstract
A new node joining a blockchain network first synchronizes with the network to verify ledger state by downloading the entire ledger history. We present Aurora, a probabilistic algorithm that \textit{identifies honest nodes} for transient or persistent communication in the presence of malicious nodes in a blockchain network, or ceases operation if it is unable to do so. The algorithm allows a node joining the network to make an informed decision about its next synchronization step or to verify that a transaction is contained in a valid ledger block without downloading the entire ledger or even the header chain. The algorithm constructs a Directed Acyclic Graph on the network topology to select a subset of nodes including a predefined number of honest nodes with a given probability. It is evaluated on a Bitcoin-like network topology using an open-source blockchain simulator. We investigate algorithm performance and analyze its communication complexity. Our results show that the algorithm facilitates trustless interactions of resource-constrained nodes with a blockchain network containing malicious nodes to enable a leaner initial blockchain download or an efficient and trustless transaction inclusion verification. Moreover, the algorithm can be implemented without any changes to the existing consensus protocol.

User configurable 3D object regeneration for spatial privacy

Authors: Arpit Nama, Amaya Dharmasiri, Kanchana Thilakarathna, Albert Zomaya, Jaybie Agullo de Guzman
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2108.08273
Pdf link: https://arxiv.org/pdf/2108.08273
Abstract
Environmental understanding capability of $\textit{augmented}$ (AR) and $\textit{mixed reality}$ (MR) devices are continuously improving through advances in sensing, computer vision, and machine learning. Various AR/MR applications demonstrate such capabilities i.e. scanning a space using a handheld or head mounted device and capturing a digital representation of the space that are accurate copies of the real space. However, these capabilities impose privacy risks to users: personally identifiable information can leak from captured 3D maps of the sensitive spaces and/or captured sensitive objects within the mapped space. Thus, in this work, we demonstrate how we can leverage 3D object regeneration for preserving privacy of 3D point clouds. That is, we employ an intermediary layer of protection to transform the 3D point cloud before providing it to the third-party applications. Specifically, we use an existing adversarial autoencoder to generate copies of 3D objects where the likeness of the copies from the original can be varied. To test the viability and performance of this method as a privacy preserving mechanism, we use a 3D classifier to classify and identify these transformed point clouds i.e. perform $\textit{super}$-class and $\textit{intra}$-class classification. To measure the performance of the proposed privacy framework, we define privacy, $\Pi\in[0,1]$, and utility metrics, $Q\in[0,1]$, which are desired to be maximized. Experimental evaluation shows that the privacy framework can indeed variably effect the privacy of a 3D object by varying the privilege level $l\in[0,1]$ i.e. if a low $l<0.17$ is maintained, $\Pi_1,\Pi_2>0.4$ is ensured where $\Pi_1,\Pi_2$ are super- and intra-class privacy. Lastly, the privacy framework can ensure relatively high intra-class privacy and utility i.e. $\Pi_2>0.63$ and $Q>0.70$, if the privilege level is kept within the range of $0.17<l<0.25$.

New submissions for Thu, 2 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

Let Your Camera See for You: A Novel Two-Factor Authentication Method against Real-Time Phishing Attacks

Authors: Yuanyi Sun, Sencun Zhu, Yao Zhao, Pengfei Sun
Subjects: Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2109.00132
Pdf link: https://arxiv.org/pdf/2109.00132
Abstract
Today, two-factor authentication (2FA) is a widely implemented mechanism to counter phishing attacks. Although much effort has been investigated in 2FA, most 2FA systems are still vulnerable to carefully designed phishing attacks, and some even request special hardware, which limits their wide deployment. Recently, real-time phishing (RTP) has made the situation even worse because an adversary can effortlessly establish a phishing website replicating a target website without any background of the web page design technique. Traditional 2FA can be easily bypassed by such RTP attacks. In this work, we propose a novel 2FA system to counter RTP attacks. The main idea is to request a user to take a photo of the web browser with the domain name in the address bar as the 2nd authentication factor. The web server side extracts the domain name information based on Optical Character Recognition (OCR), and then determines if the user is visiting this website or a fake one, thus defeating the RTP attacks where an adversary must set up a fake website with a different domain. We prototyped our system and evaluated its performance in various environments. The results showed that PhotoAuth is an effective technique with good scalability. We also showed that compared to other 2FA systems, PhotoAuth has several advantages, especially no special hardware or software support is needed on the client side except a phone, making it readily deployable.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Thu, 9 Sep 21

Keyword: Text Detection

Mask is All You Need: Rethinking Mask R-CNN for Dense and Arbitrary-Shaped Scene Text Detection

Authors: Xugong Qin, Yu Zhou, Youhui Guo, Dayan Wu, Zhihong Tian, Ning Jiang, Hongbin Wang, Weiping Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2109.03426
Pdf link: https://arxiv.org/pdf/2109.03426
Abstract
Due to the large success in object detection and instance segmentation, Mask R-CNN attracts great attention and is widely adopted as a strong baseline for arbitrary-shaped scene text detection and spotting. However, two issues remain to be settled. The first is dense text case, which is easy to be neglected but quite practical. There may exist multiple instances in one proposal, which makes it difficult for the mask head to distinguish different instances and degrades the performance. In this work, we argue that the performance degradation results from the learning confusion issue in the mask head. We propose to use an MLP decoder instead of the "deconv-conv" decoder in the mask head, which alleviates the issue and promotes robustness significantly. And we propose instance-aware mask learning in which the mask head learns to predict the shape of the whole instance rather than classify each pixel to text or non-text. With instance-aware mask learning, the mask branch can learn separated and compact masks. The second is that due to large variations in scale and aspect ratio, RPN needs complicated anchor settings, making it hard to maintain and transfer across different datasets. To settle this issue, we propose an adaptive label assignment in which all instances especially those with extreme aspect ratios are guaranteed to be associated with enough anchors. Equipped with these components, the proposed method named MAYOR achieves state-of-the-art performance on five benchmarks including DAST1500, MSRA-TD500, ICDAR2015, CTW1500, and Total-Text.

Which and Where to Focus: A Simple yet Accurate Framework for Arbitrary-Shaped Nearby Text Detection in Scene Images

Authors: Youhui Guo, Yu Zhou, Xugong Qin, Weiping Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2109.03451
Pdf link: https://arxiv.org/pdf/2109.03451
Abstract
Scene text detection has drawn the close attention of researchers. Though many methods have been proposed for horizontal and oriented texts, previous methods may not perform well when dealing with arbitrary-shaped texts such as curved texts. In particular, confusion problem arises in the case of nearby text instances. In this paper, we propose a simple yet effective method for accurate arbitrary-shaped nearby scene text detection. Firstly, a One-to-Many Training Scheme (OMTS) is designed to eliminate confusion and enable the proposals to learn more appropriate groundtruths in the case of nearby text instances. Secondly, we propose a Proposal Feature Attention Module (PFAM) to exploit more effective features for each proposal, which can better adapt to arbitrary-shaped text instances. Finally, we propose a baseline that is based on Faster R-CNN and outputs the curve representation directly. Equipped with PFAM and OMTS, the detector can achieve state-of-the-art or competitive performance on several challenging benchmarks.

Keyword: Text Recognition

There is no result

Keyword: OCR

There is no result

Keyword: Handwriting

There is no result

Keyword: Scene Text

Mask is All You Need: Rethinking Mask R-CNN for Dense and Arbitrary-Shaped Scene Text Detection

Authors: Xugong Qin, Yu Zhou, Youhui Guo, Dayan Wu, Zhihong Tian, Ning Jiang, Hongbin Wang, Weiping Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2109.03426
Pdf link: https://arxiv.org/pdf/2109.03426
Abstract
Due to the large success in object detection and instance segmentation, Mask R-CNN attracts great attention and is widely adopted as a strong baseline for arbitrary-shaped scene text detection and spotting. However, two issues remain to be settled. The first is dense text case, which is easy to be neglected but quite practical. There may exist multiple instances in one proposal, which makes it difficult for the mask head to distinguish different instances and degrades the performance. In this work, we argue that the performance degradation results from the learning confusion issue in the mask head. We propose to use an MLP decoder instead of the "deconv-conv" decoder in the mask head, which alleviates the issue and promotes robustness significantly. And we propose instance-aware mask learning in which the mask head learns to predict the shape of the whole instance rather than classify each pixel to text or non-text. With instance-aware mask learning, the mask branch can learn separated and compact masks. The second is that due to large variations in scale and aspect ratio, RPN needs complicated anchor settings, making it hard to maintain and transfer across different datasets. To settle this issue, we propose an adaptive label assignment in which all instances especially those with extreme aspect ratios are guaranteed to be associated with enough anchors. Equipped with these components, the proposed method named MAYOR achieves state-of-the-art performance on five benchmarks including DAST1500, MSRA-TD500, ICDAR2015, CTW1500, and Total-Text.

Which and Where to Focus: A Simple yet Accurate Framework for Arbitrary-Shaped Nearby Text Detection in Scene Images

Authors: Youhui Guo, Yu Zhou, Xugong Qin, Weiping Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2109.03451
Pdf link: https://arxiv.org/pdf/2109.03451
Abstract
Scene text detection has drawn the close attention of researchers. Though many methods have been proposed for horizontal and oriented texts, previous methods may not perform well when dealing with arbitrary-shaped texts such as curved texts. In particular, confusion problem arises in the case of nearby text instances. In this paper, we propose a simple yet effective method for accurate arbitrary-shaped nearby scene text detection. Firstly, a One-to-Many Training Scheme (OMTS) is designed to eliminate confusion and enable the proposals to learn more appropriate groundtruths in the case of nearby text instances. Secondly, we propose a Proposal Feature Attention Module (PFAM) to exploit more effective features for each proposal, which can better adapt to arbitrary-shaped text instances. Finally, we propose a baseline that is based on Faster R-CNN and outputs the curve representation directly. Equipped with PFAM and OMTS, the detector can achieve state-of-the-art or competitive performance on several challenging benchmarks.

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Wed, 6 Oct 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

An Experimental Evaluation on Deepfake Detection using Deep Face Recognition

Authors: Sreeraj Ramachandran, Aakash Varma Nadimpalli, Ajita Rattani
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2110.01640
Pdf link: https://arxiv.org/pdf/2110.01640
Abstract
Significant advances in deep learning have obtained hallmark accuracy rates for various computer vision applications. However, advances in deep generative models have also led to the generation of very realistic fake content, also known as deepfakes, causing a threat to privacy, democracy, and national security. Most of the current deepfake detection methods are deemed as a binary classification problem in distinguishing authentic images or videos from fake ones using two-class convolutional neural networks (CNNs). These methods are based on detecting visual artifacts, temporal or color inconsistencies produced by deep generative models. However, these methods require a large amount of real and fake data for model training and their performance drops significantly in cross dataset evaluation with samples generated using advanced deepfake generation techniques. In this paper, we thoroughly evaluate the efficacy of deep face recognition in identifying deepfakes, using different loss functions and deepfake generation techniques. Experimental investigations on challenging Celeb-DF and FaceForensics++ deepfake datasets suggest the efficacy of deep face recognition in identifying deepfakes over two-class CNNs and the ocular modality. Reported results suggest a maximum Area Under Curve (AUC) of 0.98 and an Equal Error Rate (EER) of 7.1% in detecting deepfakes using face recognition on the Celeb-DF dataset. This EER is lower by 16.6% compared to the EER obtained for the two-class CNN and the ocular modality on the Celeb-DF dataset. Further on the FaceForensics++ dataset, an AUC of 0.99 and EER of 2.04% were obtained. The use of biometric facial recognition technology has the advantage of bypassing the need for a large amount of fake data for model training and obtaining better generalizability to evolving deepfake creation techniques.

Rerunning OCR -- A Machine Learning Approach to Quality Assessment and Enhancement Prediction

Authors: Pit Schneider
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2110.01661
Pdf link: https://arxiv.org/pdf/2110.01661
Abstract
Iterating with new and improved OCR solutions enforces decisions to be taken when it comes to targeting the right reprocessing candidates. This especially applies when the underlying data collection is of considerable size and rather diverse in terms of fonts, languages, periods of publication and consequently OCR quality. This article captures the efforts of the National Library of Luxembourg to support those exact decisions. They are crucial in order to guarantee low computational overhead and reduced quality degradation risks, combined with a more quantifiable OCR improvement. In particular, this work explains the methodology of the library with respect to text block level quality assessment. As an extension of this technique, another contribution comes in the form of a regression model that takes the enhancement potential of a new OCR engine into account. They both mark promising approaches, especially for cultural institutions dealing with historic data of lower quality.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts

Authors: Yuta Koreeda, Christopher D. Manning
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2110.01799
Pdf link: https://arxiv.org/pdf/2110.01799
Abstract
Reviewing contracts is a time-consuming procedure that incurs large expenses to companies and social inequality to those who cannot afford it. In this work, we propose "document-level natural language inference (NLI) for contracts", a novel, real-world application of NLI that addresses such problems. In this task, a system is given a set of hypotheses (such as "Some obligations of Agreement may survive termination.") and a contract, and it is asked to classify whether each hypothesis is "entailed by", "contradicting to" or "not mentioned by" (neutral to) the contract as well as identifying "evidence" for the decision as spans in the contract. We annotated and release the largest corpus to date consisting of 607 annotated contracts. We then show that existing models fail badly on our task and introduce a strong baseline, which (1) models evidence identification as multi-label classification over spans instead of trying to predict start and end tokens, and (2) employs more sophisticated context segmentation for dealing with long documents. We also show that linguistic characteristics of contracts, such as negations by exceptions, are contributing to the difficulty of this task and that there is much room for improvement.

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Mon, 13 Sep 21

Keyword: Text Detection

Artificial Text Detection via Examining the Topology of Attention Maps

Authors: Laida Kushnareva, Daniil Cherniavskii, Vladislav Mikhailov, Ekaterina Artemova, Serguei Barannikov, Alexander Bernstein, Irina Piontkovskaya, Dmitri Piontkovski, Evgeny Burnaev
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2109.04825
Pdf link: https://arxiv.org/pdf/2109.04825
Abstract
The impressive capabilities of recent generative models to create texts that are challenging to distinguish from the human-written ones can be misused for generating fake news, product reviews, and even abusive content. Despite the prominent performance of existing methods for artificial text detection, they still lack interpretability and robustness towards unseen models. To this end, we propose three novel types of interpretable topological features for this task based on Topological Data Analysis (TDA) which is currently understudied in the field of NLP. We empirically show that the features derived from the BERT model outperform count- and neural-based baselines up to 10% on three common datasets, and tend to be the most robust towards unseen GPT-style generation models as opposed to existing methods. The probing analysis of the features reveals their sensitivity to the surface and syntactic properties. The results demonstrate that TDA is a promising line with respect to NLP tasks, specifically the ones that incorporate surface and structural information.

Keyword: Text Recognition

There is no result

Keyword: OCR

FR-Detect: A Multi-Modal Framework for Early Fake News Detection on Social Media Using Publishers Features

Authors: Ali Jarrahi, Leila Safari
Subjects: Social and Information Networks (cs.SI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2109.04835
Pdf link: https://arxiv.org/pdf/2109.04835
Abstract
In recent years, with the expansion of the Internet and attractive social media infrastructures, people prefer to follow the news through these media. Despite the many advantages of these media in the news field, the lack of any control and verification mechanism has led to the spread of fake news, as one of the most important threats to democracy, economy, journalism and freedom of expression. Designing and using automatic methods to detect fake news on social media has become a significant challenge. In this paper, we examine the publishers' role in detecting fake news on social media. We also suggest a high accurate multi-modal framework, namely FR-Detect, using user-related and content-related features with early detection capability. For this purpose, two new user-related features, namely Activity Credibility and Influence, have been introduced for publishers. Furthermore, a sentence-level convolutional neural network is provided to combine these features with latent textual content features properly. Experimental results have shown that the publishers' features can improve the performance of content-based models by up to 13% and 29% in accuracy and F1-score, respectively.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Mon, 30 Aug 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

WAD: A Deep Reinforcement Learning Agent for Urban Autonomous Driving

Authors: Arjit Sharma, Sahil Sharma
Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2108.12134
Pdf link: https://arxiv.org/pdf/2108.12134
Abstract
Urban autonomous driving is an open and challenging problem to solve as the decision-making system has to account for several dynamic factors like multi-agent interactions, diverse scene perceptions, complex road geometries, and other rarely occurring real-world events. On the other side, with deep reinforcement learning (DRL) techniques, agents have learned many complex policies. They have even achieved super-human-level performances in various Atari Games and Deepmind's AlphaGo. However, current DRL techniques do not generalize well on complex urban driving scenarios. This paper introduces the DRL driven Watch and Drive (WAD) agent for end-to-end urban autonomous driving. Motivated by recent advancements, the study aims to detect important objects/states in high dimensional spaces of CARLA and extract the latent state from them. Further, passing on the latent state information to WAD agents based on TD3 and SAC methods to learn the optimal driving policy. Our novel approach utilizing fewer resources, step-by-step learning of different driving tasks, hard episode termination policy, and reward mechanism has led our agents to achieve a 100% success rate on all driving tasks in the original CARLA benchmark and set a new record of 82% on further complex NoCrash benchmark, outperforming the state-of-the-art model by more than +30% on NoCrash benchmark.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Thu, 14 Oct 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

Optical Character Recognition of 19th Century Classical Commentaries: the Current State of Affairs

Authors: Matteo Romanello, Sven Najem-Meyer, Bruce Robertson
Subjects: Digital Libraries (cs.DL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.06817
Pdf link: https://arxiv.org/pdf/2110.06817
Abstract
Together with critical editions and translations, commentaries are one of the main genres of publication in literary and textual scholarship, and have a century-long tradition. Yet, the exploitation of thousands of digitized historical commentaries was hitherto hindered by the poor quality of Optical Character Recognition (OCR), especially on commentaries to Greek texts. In this paper, we evaluate the performances of two pipelines suitable for the OCR of historical classical commentaries. Our results show that Kraken + Ciaconna reaches a substantially lower character error rate (CER) than Tesseract/OCR-D on commentary sections with high density of polytonic Greek text (average CER 7% vs. 13%), while Tesseract/OCR-D is slightly more accurate than Kraken + Ciaconna on text sections written predominantly in Latin script (average CER 8.2% vs. 8.4%). As part of this paper, we also release GT4HistComment, a small dataset with OCR ground truth for 19th classical commentaries and Pogretra, a large collection of training data and pre-trained models for a wide variety of ancient Greek typefaces.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Tue, 5 Oct 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

Asking questions on handwritten document collections

Authors: Minesh Mathew, Lluis Gomez, Dimosthenis Karatzas, CV Jawahar
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.00711
Pdf link: https://arxiv.org/pdf/2110.00711
Abstract
This work addresses the problem of Question Answering (QA) on handwritten document collections. Unlike typical QA and Visual Question Answering (VQA) formulations where the answer is a short text, we aim to locate a document snippet where the answer lies. The proposed approach works without recognizing the text in the documents. We argue that the recognition-free approach is suitable for handwritten documents and historical collections where robust text recognition is often difficult. At the same time, for human users, document image snippets containing answers act as a valid alternative to textual answers. The proposed approach uses an off-the-shelf deep embedding network which can project both textual words and word images into a common sub-space. This embedding bridges the textual and visual domains and helps us retrieve document snippets that potentially answer a question. We evaluate results of the proposed approach on two new datasets: (i) HW-SQuAD: a synthetic, handwritten document image counterpart of SQuAD1.0 dataset and (ii) BenthamQA: a smaller set of QA pairs defined on documents from the popular Bentham manuscripts collection. We also present a thorough analysis of the proposed recognition-free approach compared to a recognition-based approach which uses text recognized from the images using an OCR. Datasets presented in this work are available to download at docvqa.org

Keyword: OCR

Asking questions on handwritten document collections

Authors: Minesh Mathew, Lluis Gomez, Dimosthenis Karatzas, CV Jawahar
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2110.00711
Pdf link: https://arxiv.org/pdf/2110.00711
Abstract
This work addresses the problem of Question Answering (QA) on handwritten document collections. Unlike typical QA and Visual Question Answering (VQA) formulations where the answer is a short text, we aim to locate a document snippet where the answer lies. The proposed approach works without recognizing the text in the documents. We argue that the recognition-free approach is suitable for handwritten documents and historical collections where robust text recognition is often difficult. At the same time, for human users, document image snippets containing answers act as a valid alternative to textual answers. The proposed approach uses an off-the-shelf deep embedding network which can project both textual words and word images into a common sub-space. This embedding bridges the textual and visual domains and helps us retrieve document snippets that potentially answer a question. We evaluate results of the proposed approach on two new datasets: (i) HW-SQuAD: a synthetic, handwritten document image counterpart of SQuAD1.0 dataset and (ii) BenthamQA: a smaller set of QA pairs defined on documents from the popular Bentham manuscripts collection. We also present a thorough analysis of the proposed recognition-free approach compared to a recognition-based approach which uses text recognized from the images using an OCR. Datasets presented in this work are available to download at docvqa.org

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Thu, 30 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

Context based Roman-Urdu to Urdu Script Transliteration System

Authors: H Muhammad Shakeel, Rashid Khan, Muhammad Waheed
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2109.14197
Pdf link: https://arxiv.org/pdf/2109.14197
Abstract
Now a day computer is necessary for human being and it is very useful in many fields like search engine, text processing, short messaging services, voice chatting and text recognition. Since last many years there are many tools and techniques that have been developed to support the writing of language script. Most of the Asian languages like Arabic, Urdu, Persian, Chains and Korean are written in Roman alphabets. Roman alphabets are the most commonly used for transliteration of languages, which have non-Latin scripts. For writing Urdu characters as an input, there are many layouts which are already exist. Mostly Urdu speaker prefer to use Roman-Urdu for different applications, because mostly user is not familiar with Urdu language keyboard. The objective of this work is to improve the context base transliteration of Roman-Urdu to Urdu script. In this paper, we propose an algorithm which effectively solve the transliteration issues. The algorithm work like, convert the encoding roman words into the words in the standard Urdu script and match it with the lexicon. If match found, then display the word in the text editor. The highest frequency words are displayed if more than one match found in the lexicon. Display the first encoded and converted instance and set it to the default if there is not a single instance of the match is found and then adjust the given ambiguous word to their desire location according to their context. The outcome of this algorithm proved the efficiency and significance as compare to other models and algorithms which work for transliteration of Raman-Urdu to Urdu on context.

Keyword: OCR

There is no result

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Thu, 7 Oct 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

There is no result

Keyword: Handwriting

Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models

Authors: Jen-Hao Rick Chang, Ashish Shrivastava, Hema Swetha Koppula, Xiaoshuai Zhang, Oncel Tuzel
Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2110.02891
Pdf link: https://arxiv.org/pdf/2110.02891
Abstract
Controllable generative sequence models with the capability to extract and replicate the style of specific examples enable many applications, including narrating audiobooks in different voices, auto-completing and auto-correcting written handwriting, and generating missing training samples for downstream recognition tasks. However, typical training algorithms for these controllable sequence generative models suffer from the training-inference mismatch, where the same sample is used as content and style input during training but different samples are given during inference. In this paper, we tackle the training-inference mismatch encountered during unsupervised learning of controllable generative sequence models. By introducing a style transformation module that we call style equalization, we enable training using different content and style samples and thereby mitigate the training-inference mismatch. To demonstrate its generality, we applied style equalization to text-to-speech and text-to-handwriting synthesis on three datasets. Our models achieve state-of-the-art style replication with a similar mean style opinion score as the real data. Moreover, the proposed method enables style interpolation between sequences and generates novel styles.

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Mon, 4 Oct 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

There is no result

Keyword: OCR

DiVRsify: Break the Cycle and Develop VR for Everyone

Authors: Tabitha C. Peck, Kyla McMullen, John Quarles
Subjects: Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2110.00497
Pdf link: https://arxiv.org/pdf/2110.00497
Abstract
Virtual reality technology is biased. It excludes approximately 95% the world's population by being primarily designed for male, western, educated, industrial, rich, and democratic populations. This bias may be due to the lack of diversity in virtual reality researchers, research participants, developers, and end users, fueling a noninclusive research, development, and usability cycle. The objective of this paper is to highlight the minimal virtual reality research involving understudied populations with respect to dimensions of diversity, such as gender, race, culture, ethnicity, age, disability, and neurodivergence. Specifically, we highlight numerous differences in virtual reality usability between underrepresented groups compared to commonly studied populations. These differences illustrate the lack of generalizability of prior virtual reality research. Lastly, we present a call to action with the aim that, over time, will break the cycle and enable virtual reality for everyone.

Keyword: Handwriting

There is no result

Keyword: Scene Text

There is no result

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result

New submissions for Fri, 10 Sep 21

Keyword: Text Detection

There is no result

Keyword: Text Recognition

PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text Recognition

Authors: Zhi Qiao, Yu Zhou, Jin Wei, Wei Wang, Yuan Zhang, Ning Jiang, Hongbin Wang, Weiping Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2109.04145
Pdf link: https://arxiv.org/pdf/2109.04145
Abstract
Nowadays, scene text recognition has attracted more and more attention due to its various applications. Most state-of-the-art methods adopt an encoder-decoder framework with attention mechanism, which generates text autoregressively from left to right. Despite the convincing performance, the speed is limited because of the one-by-one decoding strategy. As opposed to autoregressive models, non-autoregressive models predict the results in parallel with a much shorter inference time, but the accuracy falls behind the autoregressive counterpart considerably. In this paper, we propose a Parallel, Iterative and Mimicking Network (PIMNet) to balance accuracy and efficiency. Specifically, PIMNet adopts a parallel attention mechanism to predict the text faster and an iterative generation mechanism to make the predictions more accurate. In each iteration, the context information is fully explored. To improve learning of the hidden layer, we exploit the mimicking learning in the training phase, where an additional autoregressive decoder is adopted and the parallel decoder mimics the autoregressive decoder with fitting outputs of the hidden layer. With the shared backbone between the two decoders, the proposed PIMNet can be trained end-to-end without pre-training. During inference, the branch of the autoregressive decoder is removed for a faster speed. Extensive experiments on public benchmarks demonstrate the effectiveness and efficiency of PIMNet. Our code will be available at https://github.com/Pay20Y/PIMNet.

Keyword: OCR

There is no result

Keyword: Handwriting

There is no result

Keyword: Scene Text

PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text Recognition

Authors: Zhi Qiao, Yu Zhou, Jin Wei, Wei Wang, Yuan Zhang, Ning Jiang, Hongbin Wang, Weiping Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2109.04145
Pdf link: https://arxiv.org/pdf/2109.04145
Abstract
Nowadays, scene text recognition has attracted more and more attention due to its various applications. Most state-of-the-art methods adopt an encoder-decoder framework with attention mechanism, which generates text autoregressively from left to right. Despite the convincing performance, the speed is limited because of the one-by-one decoding strategy. As opposed to autoregressive models, non-autoregressive models predict the results in parallel with a much shorter inference time, but the accuracy falls behind the autoregressive counterpart considerably. In this paper, we propose a Parallel, Iterative and Mimicking Network (PIMNet) to balance accuracy and efficiency. Specifically, PIMNet adopts a parallel attention mechanism to predict the text faster and an iterative generation mechanism to make the predictions more accurate. In each iteration, the context information is fully explored. To improve learning of the hidden layer, we exploit the mimicking learning in the training phase, where an additional autoregressive decoder is adopted and the parallel decoder mimics the autoregressive decoder with fitting outputs of the hidden layer. With the shared backbone between the two decoders, the proposed PIMNet can be trained end-to-end without pre-training. During inference, the branch of the autoregressive decoder is removed for a faster speed. Extensive experiments on public benchmarks demonstrate the effectiveness and efficiency of PIMNet. Our code will be available at https://github.com/Pay20Y/PIMNet.

Keyword: Text Segmentation

There is no result

Keyword: TextSpotter

There is no result

Keyword: Text Spotting

There is no result