Instant Download

Download your project material immediately after online payment.

Project File Details


3,000.00

100% Money Back Guarantee

File Type: MS Word (DOC) & PDF

File Size:1,202KB

Number of Pages:58

 

ABSTRACT

 

Text analysis is a branch of data mining that deals with text documents. This project brings to light the classification of texts into their various categories. The structured and unstructured data seems to on a high rise in this era. Thus, to be able to classify this data is important. Classification however starts from collection, preprocessing, and feature extraction. There are several techniques that can be used for text classification, but machine learning algorithms will be employed in this project. Because of the advent of Natural Language Processing, we will be able to see the need for feature extraction and selection.
In this research, we will be able to see how the computer intelligently classifies text into their various categories. Emphasis will be on English language word document.
Key words: Matching Algorithms (ML), text classification, MNB, KNN, SVM,

 

TABLE OF CONTENTS

ACKNOWLEDGEMENT ……………………………………………………………………………………………… vii
Chapter1 …………………………………………………………………………………………………………………… 3
1.1 Introduction ………………………………………………………………………………………………….. 3
1.2 Natural Language Processing (NLP) ……………………………………………………………….. 4
1.3 Objectives of the Project ………………………………………………………………………………… 5
1.4 Problem Statement ……………………………………………………………………………………….. 6
1.5 Limitations of the Study …………………………………………………………………………………. 6
1.6 Chapter Organization …………………………………………………………………………………….. 7
Chapter 2 ………………………………………………………………………………………………………………….. 8
2.1 Introduction to Machine Learning …………………………………………………………………….. 8
2.1.1 Supervised Learning ………………………………………………………………………………….. 9
2.1.2 Unsupervised Machine learning …………………………………………………………………..10
2.1.3 Applications of Machine Learning ………………………………………………………………..12 2.2 Literature Review ………………………………………………………………………………………….13
Chapter 3 ………………………………………………………………………………………………………………….16
3.1 Classification Steps ………………………………………………………………………………………16
3.2 Project Requirements ……………………………………………………………………………………16
3.3 Data collection and preparation ………………………………………………………………………17
3.4 Preprocessing Data ………………………………………………………………………………………18
3.5 Feature Extraction ………………………………………………………………………………………..20
3.5.1 Vectorization …………………………………………………………………………………………….20
3.5.2 Classification Technique …………………………………………………………………………….22
Chapter 4 ………………………………………………………………………………………………………………….29
4.1 Text Representation ……………………………………………………………………………………..30
4.2 Test Classification and Preprocessing ……………………………………………………………..31
4.2.1 Precision ………………………………………………………………………………………………….32
4.2.2 Recall ……………………………………………………………………………………………………..32
4.2.3 F1 score ………………………………………………………………………………………………….32
4.2.4 Accuracy ………………………………………………………………………………………………….32
4.3 Results ……………………………………………………………………………………………………….33
4.3.1 Multinomial Naïve Bayes (MNB) ………………………………………………………………….33
4.3.2 Logistic Regression (LR) …………………………………………………………………………….34
4.3.3 Support Vector Machine (SVM) …………………………………………………………………..35
2
4.3.4 K-Nearest Neighbor (KNN) …………………………………………………………………………36
4.4 Comparison Result ……………………………………………………………………………………….37
Chapter 5 ………………………………………………………………………………………………………………….39
5.1 Conclusion …………………………………………………………………………………………………..39
5.2 Challenges ………………………………………………………………………………………………….39
5.3 Future Works ……………………………………………………………………………………………….39

 

 

CHAPTER ONE

 

1.1 Introduction
Text analysis is a field that has seen been growing rapidly over the years. The idea of text analysis came into being around the year 1950 (“Text Analytics: A Primer | GreenBook,”2017). Its main objective is to enable a descriptive view of structures and contents of a text document.
Text analysis is the act of understanding or deriving important information contained within a document or text. While structured data is generally being managed using a database system, text data is typically managed using a search engine owing to the fact that unstructured data is involved. The percentage of unstructured data generated has been increasing rapidly as the years go by. The growth rate is about 55% to 65% each year as statistical records show (“Structured vs. Unstructured Data,” 2015). Analyzing text based on sentimental views is of utmost importance in this era. To understand or predict the emotional balance or secret messages in a subjective context helps the data analysts in doing the desired job.
Data analysis helps to obtain reasonable facts that can be used as a company’s marketing strategy. Responses from individuals can be used to determine when products are to be produced in bulk. Using sentiment analysis can help in determining positive or negative views about a company (Saranya & Jayanthy, 2018), confidential and non-confidential messages to be seen by the public can also be determined through its use.
Managing huge data set is quite difficult to handle, hence, the reason text classification comes into play. Text classification is the act of classifying or arranging
4
a large amount of data generated into different or pre-defined categories as required. Without these classifications, it would be difficult to accumulate data. Proper arrangement of these data makes work easy. Despite making work easy, data classification requires a lot of work. Hence the reason for the introduction of machine learning concept.
The purpose of using machine learning is to be able to develop a suitable algorithm that can be used to ensure intelligent data classification so as to augment system performance using the trained dataset. In other words, machine learning helps to solve intelligent text classification problem. There are certain classification steps to be followed in order to do this; they are as follows:
i. Data Collection: This involves accumulating data needed for the experiment.
ii. Text Preprocessing: Preparing the raw data for another process. It includes removing unwanted spacing, capitalization, etc. iii. Feature Extraction or Selection: This involves selecting and reducing the number of randomized variables within the text.
iv. Text Representation: Applying supervised and unsupervised learning algorithms on the dataset.
v. Text Classification and Processing: Grouping of text into categories and using the training and test datasets to arrive at the result
1.2 Natural Language Processing (NLP) This is a significant research area that deals with ways of enabling computers to understand and interpret human language. Ordinarily, computers do not really
5
understand human language. NLP is based on deep learning which is a research area in machine learning (Nene, 2017). It has been used in various field of studies such as robotics, electrical, and computer engineering, etc. (Dr. S. Vijayarani, Ms. J. Ilamathi, 2018). It comprises different algorithms that can be used to collect data on how human beings understand and use languages and for transforming the unstructured data for the computer understands. Although, acting as the middleman between computers and humans, it can be applied in various sentiment analysis applications in areas such as speech recognition, stemming, artificial intelligence, text summarization, etc. It helps to dissolve the structure and obtain essential information for the computer.
When text or data is provided, the computer makes use of some algorithms that can be used to mine information from each sentence in the dataset and store the important data (“An easy introduction to Natural Language Processing,” 2015). Sometimes, the computer may fail to properly understand the meaning of a sentence, thus, leading to obscure results like the translation of words from English to the Russian language that occurred in the 1950s. However, it does not disprove the fact that natural language is hard to discern by computers.
1.3 Objectives of the Project
The aim of this project is to be able to classify text according to its topics using supervised machine learning algorithms in order to classify data by employing machine learning algorithms. It has been pointed out that applying machine learning
6
algorithms can help in achieving the desired result. Further, we also aim to compare the results obtained using the proposed model with the results obtained using other related models in terms of accuracy.
1.4 Problem Statement
While large datasets have been made available over the internet, classification of these datasets is becoming increasingly important. Further, several projects that have been carried out in the past could not come up with a way to effectively perform data categorization despite the different types of features that were used (Saranya & Jayanthy, 2018). N-gram features (unigram and bigram) have often been used. In this project, we will focus on the use of 4-gram feature in performing data classification of a text analyzer whilst keeping in mind the end goal of achieving maximum accuracy.
1.5 Limitations of the Study
This research is limited to only the implementation of machine learning algorithms with huge datasets in a big data platform. The platform intended to be used is the Hortonworks platform, however, due to space and unreliable power supply challenges, it was not used.
7
1.6 Chapter Organization
This project entails the steps and results from text classification and analysis. It comprises five (5) chapters. Each of the chapters discusses distinct topics. Below is a brief description of the chapters:
i. Chapter 1 discusses an introduction to text analysis, sentiment analysis, natural language processing, aims and the limitations of the project.
ii. Chapter 2 focuses on machine learning concepts to be used, the classification of datasets and its comparison with previous works/projects
iii. Chapter 3 presents the necessary materials that will be used to accomplish this project, the methodology, its setup and demonstration of how the classification will be implemented.
iv. Chapter 4 discusses the implementation of the work and the results obtained.
v. Chapter 5 presents the conclusion of the project.

GET THE FULL WORK

DISCLAIMER: All project works, files and documents posted on this website, projects.ng are the property/copyright of their respective owners. They are for research reference/guidance purposes only and the works are crowd-sourced. Please don’t submit someone’s work as your own to avoid plagiarism and its consequences. Most of the project works are provided by the schools' libraries to help in guiding students on their research. Use it as a guidance purpose only and not copy the work word for word (verbatim). If you see your work posted here, and you want it to be removed/credited, please call us on +2348157165603 or send us a mail together with the web address link to the work, to hello@projects.ng. We will reply to and honor every request. Please notice it may take up to 24 or 48 hours to process your request.