Paper: Towards Users Clustering by Analyzing Web Application Log Files through the Utilization of Spark
Abstract: Nowadays, many data mining algorithms should deal with the huge data. On basis of distributed computing, a method of clustering users by analyzing huge numbers of web application log files is proposed, the proposed method is integrated into the semantic accessing information. The process includes data pretreatment, data cleaning or merging algorithm. It mines out web application log's user accessing time, click times and preferred accessing content, etc. It scales with batch processing ability over standalone tools and inmemory computing capacity for log analysis. With utilizing Spark, the program for dealing web application log file data is developed. Besides, it proves Spark's excellent performance in data dealing, and validates the method's efficiency and practicability. Experimental results show that, in Spark cluster computing environment, the method deals with huge numbers of log files effectively, and improves the efficiency of data mining obviously.