Instant Download

Download your project material immediately after online payment.

Project File Details


3,000.00

100% Money Back Guarantee

File Type: MS Word (DOC) & PDF
File Size: 1,071KB
Number of Pages:77

ABSTRACT

Document clustering is an automatic unsupervised machine learning technique that aimed at grouping related set of items into clusters or subsets. The target is creating clusters with high internal coherence, but different from each other substantially. Simply, items within the same cluster should be highly similar, while maintaining high dissimilarity with items within other clusters. Automatic clustering of documents has played a very significant role in many fields including data mining and information retrieval. This thesis aimed to improve the overall efficiency of a document clustering technique using N-grams and efficient similarity measure. The thesis improves the purity and accuracy of the obtained clusters. The preprocessing method is based on N-grams (sequence of N consecutive characters) which do not give consideration to stop-words or other special punctuations but creates and overlap among the content of a document which further gives room to ignore errors thereby increasing the quality of the clusters to a great extent. This approach clusters the news articles based on their N-grams representation, thereby reducing noise and increase the probability of occurrences of the sequences within the articles document. The proposed clustering technique has parameters which can be changed accordingly at the document representation level in order to improve the efficiency and quality of the generated clusters. The results from the experiment using R programming environment were carried out on real datasets of the Reuters21578 and 20Newsgropus proved the effectiveness of the proposed clustering technique at different levels of N-grams in terms of the accuracy and purity of the generated clusters. The results also showed that the proposed clustering technique perform averagely better than the baseline technique both in terms of accuracy and purity with a best results when the window of N-grams = 3.

TABLE OF CONTENTS

Cover Page ………………………………………………………………………………………………………… …………i
Title Page ……………………………………………………………………………………………………………. ………ii
Declaration ……………………………………………………………………………………………………………………. iii
Certification ………………………………………………………………………………………………………………….. iv
Acknowledgement ………………………………………………………………………………………………………….. v
Abstract ………………………………………………………………………………………………………………………… vi
Table Of Contents …………………………………………………………………………………………………………. vii
List Of Tables ……………………………………………………………………………………………………………….. xi
List Of Figures ……………………………………………………………………………………………………………… xii
List Of Abbreviations …………………………………………………………………………………………………… xiii
CHAPTER ONE …………………………………………………………………………………………………………….. 1
INTRODUCTION ………………………………………………………………………………………………………….. 1
1.1 Background Of The Study……………………………………………………………………………………….. 1
1.1.1 Motivation For The Study …………………………………………………………………………………. 2
1.2 Data Mining…………………………………………………………………………………………………………… 3
1.3 Clustering ……………………………………………………………………………………………………………… 3
1.4 Applications Of Clustering………………………………………………………………………………………. 5
1.5 Problem Statement ………………………………………………………………………………………………… 6
1.6 Aim And Objectives ……………………………………………………………………………………………….. 7
1.6.1 Aim ………………………………………………………………………………………………………………… 7
1.6.2 Objectives ……………………………………………………………………………………………………….. 7
1.7 Significance Of The Study ………………………………………………………………………………………. 7
1.8 Scope And Limitation Of The Study…………………………………………………………………………. 8
1.9 Thesis Organization………………………………………………………………………………………………… 8
CHAPTER TWO ……………………………………………………………………………………………………………. 9
LITERATURE REVIEW ………………………………………………………………………………………………… 9
viii
2.1 Basic Terminologies And Concepts ………………………………………………………………………….. 9
2.2 Textual Document Representation Methods …………………………………………………………….. 10
2.2.1 Word-Based Representation …………………………………………………………………………….. 10
2.2.2 Term-Based Representation …………………………………………………………………………….. 11
2.2.3 N-Grams-Based Representation ……………………………………………………………………….. 11
2.3 Textual Document Pre-Processing Methods …………………………………………………………….. 12
2.3.1 Tokenization ………………………………………………………………………………………………….. 12
2.3.2 Lowercase Conversion ……………………………………………………………………………………. 12
2.3.3 Punctuation Removal ………………………………………………………………………………………. 13
2.3.4 Term Filtering (Stop Words Removal) ………………………………………………………………. 13
2.3.5 Alphanumeric And Short Length Words Removal ……………………………………………… 13
2.3.6 Lemmatization ……………………………………………………………………………………………….. 13
2.3.7 Stemming ………………………………………………………………………………………………………. 14
2.4 Textual Document Similarity Measures …………………………………………………………………… 14
2.4.1 Metric Conditions …………………………………………………………………………………………… 14
2.4.2 Euclidean Distance Measure ……………………………………………………………………………. 15
2.4.3 Cosine Distance Measure ………………………………………………………………………………… 15
2.4.4 Jaccard Distance Measure ……………………………………………………………………………….. 16
2.4.5 Manhattan Distance Measure …………………………………………………………………………… 16
2.4.6 Pearson Correlation Measure ……………………………………………………………………………. 17
2.5 Categories Of Clustering Algorithms ………………………………………………………………………. 17
2.5.1 Hierarchical (Representative-Based) Clustering …………………………………………………. 18
2.5.1.1 Agglomerative (Bottom-Up) Clustering ………………………………………………………….. 18
2.5.1.2 Divisive (Top-Down) Clustering ……………………………………………………………………. 18
2.5.2 Partition-Based Clustering ……………………………………………………………………………….. 19
2.5.2.1 K-Means Algorithm ……………………………………………………………………………………… 19
2.5.2.2 K-Medoids Algorithm ………………………………………………………………………………….. 19
2.6 Review Of Literature …………………………………………………………………………………………….. 20
CHAPTER THREE ………………………………………………………………………………………………………. 24
METHODOLOGY ……………………………………………………………………………………………………….. 24
ix
3.1 Introduction …………………………………………………………………………………………………………. 24
3.2 Proposed Methodology (Clustering Technique) ……………………………………………………….. 24
3.2.1 Example 3.1 …………………………………………………………………………………………………… 25
3.3 News Articles Pre-Processing ………………………………………………………………………………… 28
3.4 News Articles N-Grams Representation ………………………………………………………………….. 29
3.4.1 Representation With 1-Gram ……………………………………………………………………………. 29
3.4.2 Representation With N-Grams …………………………………………………………………………. 30
3.4.3 Weight Normalization …………………………………………………………………………………….. 30
3.5 Vector Space Model ……………………………………………………………………………………………… 31
3.6 Dimensionality Reduction On Vector Features ………………………………………………………… 31
3.7 Improved Sqrt-Cosine Similarity Measure ………………………………………………………………. 31
3.8 K-Means Clustering ……………………………………………………………………………………………… 32
3.9 Evaluation Methods………………………………………………………………………………………………. 33
CHAPTER FOUR …………………………………………………………………………………………………………. 35
RESULTS AND DISCUSSIONS ……………………………………………………………………………………. 35
4.1 Description Of Datasets ………………………………………………………………………………………… 35
4.2 Evaluation Of Experiments ……………………………………………………………………………………. 35
4.2.1 Experimental Set-Up ………………………………………………………………………………………. 36
4.2.2 Results ………………………………………………………………………………………………………….. 36
4.2.2.1 Result Of Technique With N-Gram Equal To 1 (N=1) ……………………………………… 37
4.2.2.2 Result Of Technique With N-Grams Equal To 2 (N=2) ……………………………………. 38
4.2.2.3 Result Of Technique With N-Grams Equal To 3 (N=3) ……………………………………. 38
4.2.2.4 Result Of Technique With N-Grams Equal To 4 (N=4) ……………………………………. 39
4.2.2.5 Results Comparison Of Proposed Technique And Baseline Technique ………………. 42
4.3 Our Contributions …………………………………………………………………………………………………. 43
CHAPTER FIVE ………………………………………………………………………………………………………….. 44
SUMMARY, CONCLUSION, AND FUTURE WORK …………………………………………………….. 44
5.1 Summary …………………………………………………………………………………………………………….. 44
5.2 Conclusion …………………………………………………………………………………………………………… 45
x
5.3 Future Work ………………………………………………………………………………………………………… 45
List Of Publications ………………………………………………………………………………………………………. 57
Appendix ……………………………………………………………………………………………………………………… 58
Coding …………………………………………………………………………………………………………………………. 58

CHAPTER ONE

INTRODUCTION
This chapter introduced the background of the study, motivation of the study, introduction to data mining, clustering and its applications, problem statement, aim and objectives of the study, significance of the study, scope and limitation of the study, and finally the organization of the thesis.
1.1 Background of the Study
The world we live in is full of data. Computers have been accepted as the best means of data storage. This is because of the fact that the data is saved very easily in the computer with high convenience, anybody that has access to a computer is able to do it, and more importantly, many users can share stored information, or send to different locations (Kriegel et al., 2007). As the number of text documents stored in large databases increases, this poses a huge challenge of understanding hidden patterns or relationships in the data. Text data, being not in numerical format, can hardly be analyzed directly using statistical methods. Information overload or drowning in data is a common complaint by people as they see the potential value of information, yet are frustrated in their inability to derive benefit from it due to its volume and complexity (Sowjanya & Shashi, 2010; Han et al., 2012).
Due to the rapid growth of online news articles, journals, books, research papers, and web pages every day; the need on how to quickly find the most important, interesting, valuable, or entertaining items has arisen. This is because we are overwhelmed by the increasing volume of information made available online (Bouras & Tsogkas, 2016; Rupnik et al., 2016). Humans throughout history have used information to achieve lots of great things such as predicting the future to avoid disaster and to make some vital decisions (Butler & Keselj, 2009; Jatowt & Au Yeung, 2011; Bouras & Tsogkas, 2016). The problem of overloading the Internet with this huge amount of information makes searching very tedious to the users, the enormous demands for techniques that will efficiently and effectively derive profitable knowledge from these diverse, unstructured information are highly required (Bouras & Tsogkas, 2013; Popovici et al., 2014; Lwin & Aye, 2017).
2
One of the most important means to deal with data is classifying or grouping it into clusters or categories. Classification have played an important and an indispensable role throughout human history (Wu et al., 2008). There exist two types of classification, the supervised and unsupervised. In the supervised classification, available predefined knowledge is needed, whereas in the unsupervised classification sometimes referred to as clustering or exploratory data analysis, no predefined labeled data is needed (Agrawal et al., 1998; Tao et al., 2004).
Grouping similar data such as news article based on their characteristics is an important issue. Grouping can be done on the basis of some similarity measures. Several similarity measures (such as Gauging, Jaccard, Euclidean, Edit, and Cosine) have been proposed and applied in computing the similarity between two different textual documents based on character matching, word semantics, and word sense (Damashek, 1995; Huang, 2008; Qiujun, 2010; Svadasa & Jhab, 2014; Akinwale & Niewiadomski, 2015; Sonia, 2016; Huang et al., 2017). The rationale behind every given method of measuring the similarity between two textual documents is based on the increasing quest to improve the quality and the effectiveness of the existing clustering or filtering techniques (Shah & Mahajan, 2012; Sonia, 2016; Singh et al., 2017).
1.1.1 Motivation for the Study
This research is motivated by the fact that going through an online news portal, we observed the following challenges that needed efficient clustering technique:
i. The available news articles were large in number.
ii. The news articles were added online each and every day in the large number.
iii. Different sources used to contribute in adding news articles corresponding to the same news.
iv. Real-time update of recommendation had to be generated.
By using an efficient clustering technique the domain of search for recommendation could be reduced because most users are interested in the news that belong to some certain number of clusters. Time efficiency will be improved to an extent.
This research is motivated mainly by investigating the possibilities of improving the effectiveness of text document clustering techniques by pointing out reasons why the already existing techniques (algorithms) are ineffective and getting their solutions.
3
1.2 Data Mining
Data mining is a field that deals with structured and unstructured type of data to derive knowledge or meaningful information which are previously unknown by using the machine learning algorithms (Shah & Mahajan, 2012; Svadasa & Jhab, 2014; Allahyari et al., 2017). It has been applied in textual documents to predict and group related items in order to create a better and clearer understanding of such items (Grineva et al., 2009; Rothe & Schütze, 2017; Wei et al., 2017). The process of discovering, extracting nontrivial and interesting knowledge or patterns which are previously unknown from unstructured text document is referred to as text mining (Agrawal et al., 1998; Singh et al., 2017; Lin et al., 2017). Text mining also known as Knowledge Discovery from Text (KDT) is the process of extracting information of high quality from text (i.e. structured such as RDBMS data, semi-structured such as XML, and unstructured text such as document containing words, videos, and images). This covers a large set of related topics and algorithms used in analyzing text, spanning various computer science bodies of knowledge, which include information retrieval, natural language processing, data mining, machine learning, many application domains web and biomedical sciences (Allahyari et al., 2017).
1.3 Clustering
Clustering has been regarded as one of the most popular algorithms in data mining and have been extensively studied in relation to text (Wu et al., 2008; Reddy, 2017). Clustering has wide-ranging applications in classification (Wu et al., 2008; Allahyari et al., 2017), data visualization (Ferreira et al., 2013), and organizing documents (Slonim & Tishby, 2000; Issal & Ebbesson, 2010). Clustering, in general, is an unsupervised data mining technique that groups highly related set of objects into the same class while maintaining high dissimilarity with other class(es) (Miao et al., 2005; Parapar & Barreiro, 2009; Shah & Mahajan, 2012; Bouras & Tsogkas, 2013). Computation of the similarity is done using a particular similarity measure. Text clustering has different levels of granularities where clusters can be any of this document segments: paragraphs, sentences, or even terms. One of the main techniques used for organizing documents in the quest to enhance retrieval and support browsing, for example, Cutting et al., (2017) applied clustering to produce table of content for a large collection of documents. Text clustering is basically applying the functionality of data mining, clustering analysis, to textual documents (Svadasa &
4
Jhab, 2014). Clustering analysis have also been applied in numerous areas of life to solve problems such as financial forecasting (Butler & Keselj, 2009), predicting the future expectations (Jatowt & Au Yeung, 2011), and also in the grouping of related news articles or textual documents (Bouras & Tsogkas, 2013; Rupnik et al., 2016).
Clustering has been sometimes referred to as automatic classification (Jajoo, 2008; Bharti & Babu, 2017); however; it is inaccurate because the clusters are not known prior to processing whereas in the case of classification there are pre-defined classes. In clustering, it is the nature of the data and the distribution that will determine membership of a cluster, in contrast to classification where the classifier learns the association among classes and objects from a so-called training set, i.e. correctly labeled dataset beforehand, and then transfer the learned behavior on the unlabeled data set. Figure 1.1 below shows an example of a dataset with a clear cluster structure C1, C2, C3, and C4 respectively:
Figure 1.1: An Example of a Clear Cluster Structure Dataset
5
1.4 Applications of Clustering
Clustering has been regarded as the most common type of unsupervised learning and a major tool applicable in many fields of sciences and businesses. There are many areas where it has been applied to solve problems. The basic directions in which clustering is used have been summarize as follows (Witten, 2004; Ayadi et al., 2016; Allahyari et al., 2017):
i. Finding Similar Documents: This is the feature that is often used when a user spotted a single “good” document in result of search and wants more of such document. The most interesting characteristics here is the discovery of documents that are alike in contrast to search based approaches conceptually that are only able to discover whether the documents share many words that are same using clustering.
ii. Organizing Large Document Collection: Finding document relevant to a particular query is the focus of document retrieval, but this has failed to solve the problem of deriving sense from large number of documents that are uncategorized. This is having the challenge of organizing these documents in an identical taxonomy to the one humans would create given enough time and make use of it as a browsing interface to the documents collected originally.
iii. Detection of Duplicate Content: There is need to find duplicates or near-duplicates in a large number of documents in many applications. Clustering has been applied in plagiarism detection, grouping related news article and reordering rank of search results (assuring higher diversity among the topmost documents). The description of clusters is rarely needed in such applications.
iv. Recommendation Systems: These applications recommend articles to a user based on the articles that the user read already. Clustering of articles makes recommendations in real time possible and improves the quality of the system a lot because only related articles are recommended to the user of the system.
v. Search Optimization: Clustering helps a lot in improving the efficiency and quality of search engines as the query from the user can be first compared to other clusters instead of directly comparing to documents and search results can be arranged easily too.
vi. Natural Language Processing (NLP): This is a sub-field in computer science, linguistics, and artificial intelligence that aim at using computers to understand natural language. Many
6
text mining algorithms extensively apply the NLP techniques, such as part of speech tagging (POS tagging), syntactic parsing, and other analysis in linguistics.
vii. Text Summarization: Text mining applications in many fields need to summarize the documents in order to have a precise overview of the large document or a collection of documents on a topic. Summarization techniques are of two categories in general: extractive this is a summarization where information units extracted from the original text is comprised in the summary, and contrary abstractive this is a summarization where a “synthesized” information that may not be in the original text may be contained in the summary.
1.5 Problem Statement
The degree of purity and accuracy of a clustering technique results is one of the major challenges in text clustering. This is attributed to the basic tasks performed on the text such as: document feature selection, document similarity measure selection, selecting an appropriate algorithm for clustering, clustering algorithms efficiency in terms of Central Processing Unit (CPU) resource, and associating useful conclusion to the final clusters (Shah & Mahajan, 2012; Bouras & Tsogkas, 2013; Svadasa & Jhab, 2014; Khabia & Chandak, 2015; Rupnik et al., 2016; Lwin & Aye, 2017; Huang et al., 2017). The explanation is as follows:
i. Document Feature Selection: This is the process that identify the terms that positively impact on the clustering process.
ii. Document Similarity Measure Selection: This is the distance calculation method between two different documents which contribute greatly to the quality of a cluster.
iii. Selecting an Appropriate Algorithm for Clustering: There are several algorithms for clustering such as k-means and density-based. Choosing the best algorithm to apply is a challenging task.
iv. Clustering Algorithm Efficiency in Terms of CPU resource: This measures how well the intended technique utilizes the available resources.
v. Associating useful Conclusion to the Final Clusters: This is checking the level of information that can be derived from the final clusters.
Many researches have been conducted via text mining approaches in the quest to improve the accuracy and purity of clustering technique and have been applied in areas like; search engine
7
optimization, detection of plagiarism (Shah & Mahajan, 2012). These researches comprise; a phrasal-based clustering news article (Pera & Ng, 2007); enhancing news articles clustering with word-based N-grams (Bouras & Tsogkas, 2013); a multi-lingual news document similarity and event tracking system using latent semantic indexing technique (Rupnik et al., 2016). Most techniques for textual document clustering have been designed with either traditional similarity measures such as Jaccard, and Cosine, or word-based document representation which have proven to be less effective in terms of clustering accuracy and purity (Bouras & Tsogkas, 2016; Rupnik et al., 2016; Lwin & Aye, 2017; Huang et al., 2017; Sohangir & Wang, 2017a, 2017b).
This thesis proposed a technique for clustering news articles and other related textual documents using an efficient similarity measure known as “improved sqrt-cosine similarity measure” and word-based N-grams. This thesis is based on the enhancement of the weaknesses of the clustering techniques designed by Bouras & Tsogkas, (2016) which uses traditional similarity measure with N-gram and Sohangir & Wang, (2017a, 2017b) which proposed an efficient similarity measure but did not test its suitability using N-grams (sequence of N consecutive characters) data representation technique.
1.6 Aim and Objectives
1.6.1 Aim
The aim of this research is to enhance an existing clustering technique for news articles and other related textual documents.
1.6.2 Objectives
The major objectives of this research can be summarised as follows:
i. To improve the accuracy and purity of a technique for clustering news articles and other textual documents using an efficient similarity measure, N-grams, and k-means clustering algorithm.
ii. To compare the effectiveness of the enhanced clustering technique results with extant technique(s).
1.7 Significance of the Study
Clustering is one of the major machine learning algorithms used in many areas of life which has contributed positively to almost all areas of human endeavors either directly or indirectly. The findings in this study will redound to the benefit of the society most especially in the field of
8
data mining; specifically the area of unsupervised machine learning algorithms (clustering techniques) design; considering the important role document categorization plays in the effective use of the internet today. The greater demand of techniques that will efficiently optimize grouping related documents such as news articles in the internet justifies the need for effective, clustering approaches. Thus, news articles management systems that apply the recommended clustering approach derived from the result of this study will be able to perform better. The user of such system will experience a more accurate and satisfying service. For the researchers, the study will help them uncover critical areas in the clustering process that many researchers were not able to fully explore. Thus, an enhanced technique of clustering is attained.
1.8 Scope and Limitation of the Study
This research is limited to coming up with an enhanced clustering technique for news articles using an efficient similarity measure known as “Improved sqrt-cosine similarity measure” with N-gram based data representation. An unsupervised machine learning algorithm called k-means only will be used in clustering. This technique can be applied in news articles management systems; information retrieval system in order to improve the quality of grouping related news articles or information for easy access by the users of such systems.
1.9 Thesis Organization
This thesis is organized into five (5) chapters. A brief of the concepts of the remaining chapters follows:
Chapter 2 provides a literature review in relation to news articles clustering techniques, reviews of several algorithms, approaches, and methodologies that have been developed for news articles clustering. The chapter indicates how literature has contributed to this area of research and the approaches that have been used.
Chapter 3 explains the methodology that has been applied in this thesis. It gives the theoretical overview of the methodology and the evaluation measures that are applied to compare the results of the methodology.
Chapter 4 contains the results of the implemented proposed clustering technique in R programming environment, the comparisons, and discussions of the various results.
Chapter 5 contains the final summary, conclusion, and future work of the thesis that can be done using the results and findings of this thesis.
9

GET THE FULL WORK

DISCLAIMER: All project works, files and documents posted on this website, projects.ng are the property/copyright of their respective owners. They are for research reference/guidance purposes only and the works are crowd-sourced. Please don’t submit someone’s work as your own to avoid plagiarism and its consequences. Most of the project works are provided by the schools' libraries to help in guiding students on their research. Use it as a guidance purpose only and not copy the work word for word (verbatim). If you see your work posted here, and you want it to be removed/credited, please call us on +2348157165603 or send us a mail together with the web address link to the work, to hello@projects.ng. We will reply to and honor every request. Please notice it may take up to 24 or 48 hours to process your request.