Hybrid Approach to Document Anomaly Detection:  An Application to Facilitate RPA in Title Insurance

###### Author Bio: Abhijit Guha received the B. Sc. degree (Chemistry Honors) from Calcutta University, India in 2006, and MCA (master of computer applications) degree in computer applications degree from Academy of Technology under West Bengal University of Technology, India 2009. He is a Ph. D. degree candidate in Department of Data Science, CHRIST (Deemed to be University), India. Presently, he is working as a research and development scientist in First American India Private Limited. His research areas include document image processing, data mining, statistical modeling, machine learning modelling in title insurance domain. He has delivered multiple business solutions using the AI technologies and received consecutive three “Innovation of the year” awards from 2015 to 2017 by First American India for his contribution towards his research.His research interests include artificial intelligence, natural language processing, text mining statistical learning and machine learning. E-mail: abhijitguha.research@gmail.com (Corresponding author) ORCID iD: 0000-0002-3280-5730 Debabrata Samanta received the B. Sc. degree (Physics Honors) from Calcutta University, India in 2007, and MCA degree from Academy of Technology under West Bengal University of Technology, India in 2010, and the Ph. D. degree in computer science and engineering from National Institute of Technology, India in 2014. In 2015, he was a faculty member at Dayananda Sagar University, India and in 2019 he was at CHRIST (Deemed to be University ), India. Currently, he is an assistant professor in Department of Computer Science at CHRIST (Deemed to be University), India. He is a professional IEEE member, an associate life member of Computer Society of India (CSI) and a life member of Indian Society for Technical Education (ISTE). He has authored and coauthored over 127 papers in SCI/Scopus/Springer/Elsevier journals and IEEE/Springer/Elsevier conference proceedings in areas of artificial intelligence, natural language processing and image processing. He has received “Scholastic Award” at the 2nd International conference on Computer Science and IT Application, CSIT-2011, India. He has published 9 books, available for sale on Amazon and Flipkart. He has edited 1 book available on Google Book server. He has authored and coauthored of 2 Elsevier and 5 Springer Book Chapter. He is a convener, keynote speaker, technical programme committee (TPC) member in various conferences/workshops, etc. He was an invited speaker at several Institutions.His research interests include artificial intelligence, natural language processing and image processing. E-mail: debabrata.samanta369@gmail.com ORDID iD: 0000-0003-4118-2480
• Figure  1.  Typical ADMS flow

Figure  2.  Stages of experiment

Figure  3.  Count distribution of document types

Figure  4.  Projection of documents using TF-IDF feature space on a 2D plane using t-SNE. Color versions of the figures in this paper are available online.

Figure  5.  Architecture 2: Engineered TF-IDF features using autoencoder used by OSVM for anomaly detection

Figure  7.  Projection of documents using Doc2Vec embedding on a 2D plane using t-SNE

Figure  6.  Architecture 2: A generic architecture of an autoencoder

Figure  8.  Architecture 1: TF-IDF and Doc2Vec features directly used by O-SVM for anomaly detection

Figure  9.  Change of F1 and Recall with changing values of ν

Figure  10.  Autoencoder training and validation loss over the epochs

Figure  11.  Reconstruction loss distribution grouped by classes

Figure  12.  PCA and F1-score with respect to $\upsilon$. Kernel function: RBF

Figure  13.  PCA and F1 score with respect to ν. Kernel function: polynomial.

Figure  14.  Training time (TT) and Average inference time (AIT) comparison for hybrid and traditional model

## Hybrid Approach to Document Anomaly Detection:  An Application to Facilitate RPA in Title Insurance

### English Abstract

