Instant Download

Download your project material immediately after online payment.

Project File Details


3,000.00

100% Money Back Guarantee

File Type: MS Word (DOC) & PDF

File Size:  2,913KB

Number of Pages:63

 

ABSTRACT

Access to the Internet is becoming more affordable especially in Africa and with this the number of active social media users is also on the rise. Twitter is a social media platform on which users post and interact with messages known as “tweets”. These tweets are usually short with a limit of 280 characters. With over 100 million Internet users and 6 million active monthly users in Nigeria, lots of data is generated through this medium daily. This thesis aims to gain insights from the ever-growing Nigerian data generated from twitter using Topic modelling. We use Latent Dirichlet Allocation (LDA) on Nigerian heath tweets from verified accounts covering time period of 2015 – 2019 to derive top health topics in Nigeria. We detected the outbreaks of Ebola, Lassa fever and meningitis within this time frame. We also detected reoccurring topics of child immunization/vaccination. Twitter data contains useful information that can give insights to individuals, organizations and the government hence it should be further explored and utilized.

 

TABLE OF CONTENTS

 

List of Figures and Tables ………………………………………………………………………………………………. viii
1 INTRODUCTION ……………………………………………………………………………………………………. 1
1.1 Introduction ……………………………………………………………………………………………………….. 1
1.2 Data Mining ……………………………………………………………………………………………………….. 1
1.2.1 Data collection …………………………………………………………………………………………….. 2
1.2.2 Feature extraction and data cleaning ……………………………………………………………….. 2
1.2.3 Analytical processing and algorithms ……………………………………………………………… 2
1.3 Text Mining ……………………………………………………………………………………………………….. 3
1.4 Topic Modelling …………………………………………………………………………………………………. 3
1.5 Applications of Topic Modelling ………………………………………………………………………….. 3
1.6 Problem Statement ……………………………………………………………………………………………… 4
1.7 Aim and Objectives …………………………………………………………………………………………….. 4
1.8 Methodology ……………………………………………………………………………………………………… 5
2 LITERATURE REVIEW…………………………………………………………………………………………… 6
2.1 Introduction ……………………………………………………………………………………………………….. 6
2.2 Basic Terminologies ……………………………………………………………………………………………. 6
2.2.1 Term/Token …………………………………………………………………………………………………. 6
2.2.2 Document ……………………………………………………………………………………………………. 6
2.2.3 Corpus ………………………………………………………………………………………………………… 6
2.2.4 Bag of words ……………………………………………………………………………………………….. 7
2.2.5 Term frequency (TF) …………………………………………………………………………………….. 7
2.2.6 Inverse Document Frequency (IDF) ……………………………………………………………….. 7 2.2.7 Term Frequency–Inverse Document Frequency (TF-IDF) …………………………………. 8
2.2.8 Document- Term matrix ………………………………………………………………………………… 8 2.2.9 Document-Topic matrix ………………………………………………………………………………… 8 2.2.10 Topic-Word matrix……………………………………………………………………………………….. 8
2.3 Topic modelling …………………………………………………………………………………………………. 8 2.3.1 Latent Semantic Analysis (LSA) ……………………………………………………………………. 9 2.3.2 Probabilistic Latent Semantic Analysis (LSI) …………………………………………………… 9 2.3.3 Latent Dirichlet Allocation (LDA)………………………………………………………………… 10 2.3.4 Lda2Vec ……………………………………………………………………………………………………. 10
2.4 Review of Literatures ………………………………………………………………………………………… 10
2.5 Conclusion ……………………………………………………………………………………………………….. 19
3 METHODOLOGY ………………………………………………………………………………………………….. 20
3.1 Introduction ……………………………………………………………………………………………………… 20
3.2 Proposed model ………………………………………………………………………………………………… 20
3.2.1 Data Collection…………………………………………………………………………………………… 21
3.2.2 Data Pre-processing ……………………………………………………………………………………. 22
3.2.3 Dictionary Generation …………………………………………………………………………………. 23
vii
3.2.4 Bag-of-Words Generation ……………………………………………………………………………. 24
3.2.5 Topic Modelling …………………………………………………………………………………………. 24
3.2.6 Latent Semantic Analysis (LSA) ………………………………………………………………….. 24
3.2.7 Latent Dirichlet Allocation (LDA)………………………………………………………………… 25
4 RESULT AND DISCUSSION …………………………………………………………………………………. 30
4.1 Introduction ……………………………………………………………………………………………………… 30
4.2 Results …………………………………………………………………………………………………………….. 30
4.2.1 Results for 2015 …………………………………………………………………………………………. 30
4.2.2 Results for 2016 …………………………………………………………………………………………. 32
4.2.3 Results for 2017 …………………………………………………………………………………………. 34
4.2.4 Results for 2018 …………………………………………………………………………………………. 35
4.2.5 Results for 2019 …………………………………………………………………………………………. 37
4.3 Discussion ……………………………………………………………………………………………………….. 39
5 CONCLUSION AND RECOMMENDATION …………………………………………………………… 40
5.1 CONCLUSION ………………………………………………………………………………………………… 40
5.2 RECOMMENDATION ……………………………………………………………………………………… 40
References …………………………………………………………………………………………………………………….. 41
Appendix ………………………………………………………………………………………………………………………. 44

 

CHAPTER ONE

 

1.1 Introduction
There has been an exponential increase in the availability of data over the past years. According to Hal Varian, Chief Economist at Google, “Between the dawn of civilization and 2003, we only created five exabytes; now we’re creating that amount every two days. By 2020, that figure is predicted to sit at 53 zettabytes (53 trillion gigabytes) — an increase of 50 times.” While we generate 2.5 quintillion bytes of data every day, 90% of the worlds data has been created in the past two years alone (Winans et al., 2017). These data are generated from the internet, social media, IoT, through communication, digital photos, videos and services. With this increase and availability of data comes the question of what we can do with it because the data growth phenomenon continues. With smart phones and internet getting more affordable and available, the number of social media users is on the rise; this again shows an increase in data generation and availability. Every minute; Google conducts 3,877,140 searches, 49,380 users post on Instagram, 4,333,560 videos are streamed on YouTube and 473,400 tweets are sent on Twitter (Data Never Sleeps 6.0, 2018). Based on this statistic, the question once again is how can available data be used? A lot of these data come in unstructured and text format and are mined using special techniques like information retrieval, clustering, text summarization and topic modelling. Insights in politics, business, entertainment and health can be derived from the loads of data available by applying topic modelling technique.
1.2 Data Mining
Data mining encompasses numerous techniques and processes. It can be defined as the process of gaining meaningful insight and patterns from a large data set. Various forms of data (text, numeric, time series, structured unstructured etc.) require different techniques. The data mining pipeline typically has 3 phases(Aggarwal, 2015).
2
1.2.1 Data collection
Data collection is the phase of gathering the right data to accomplish a task at hand. This most times implies gathering data from various sources like surveys and questionnaires, sensors, web scrapping, etc.; as the required data might not be in one place. This phase is critical because the quality of the data gathered affects the result of the entire mining process. The data must be relevant to the task; as the old saying garbage in garbage out also applies in data mining.
1.2.2 Feature extraction and data cleaning
Real world data does not come in very good shape most times. There could be missing data, unrealistic data like negative ages, poorly scaled data like salaries ranging from N100 to N100,000,000 and so on. Hence there is a need to clean up the collected data. Also, not features / characteristics of the data might be necessary for the mining process. Hence after the data has been collected and cleaned, we must choose the needed features for the mining process.
1.2.3 Analytical processing and algorithms
At this phase our data is set for mining and depending on the task at hand, the appropriate process(es) and algorithm(s) are chosen and used on the data. Fig. 1-1 below gives an overview of the data mining process.
Figure 1-1 Data Mining Process
3
1.3 Text Mining
Text mining also known as text analytics is a streamline of data mining. In text mining, we are looking for insights from large text data set which often comes unstructured. Different text mining tasks require different mining techniques. Text mining techniques includes text categorization, text clustering, entity extraction, sentiment analysis, document summarization, topic modelling etc.
The text mining cycle is like the data mining cycle but includes a phase of ‘data structuring’ after data collection as the techniques cannot be applied directly on the unstructured text data.
1.4 Topic Modelling
Given a text data set; usually a collection of documents, one common task is to derive the topics in that data. Topic modelling is the process of applying statistical models (topic models) to extract the hidden / latent topics in the data. These models work by getting the hidden patterns in the document collection. Existing topic models include:
i. Latent Semantic Analysis (LSA)
ii. Probabilistic Latent Semantic Analysis (PLSA)
iii. Latent Dirichlet Allocation (LDA)
iv. Correlated Topic Model (CTM)
v. Explicit semantic analysis
vi. Hierarchical Dirichlet process
vii. Non-negative matrix factorization
1.5 Applications of Topic Modelling
 In information retrieval (IR), topic models are used for Smoothing language models, query expansion, search personalization (Boyd-Graber, Jordan & Hu, Yuening & Mimno, 2017).
4
 Topic models are used to track topical changes in various fields influenced by historical events through considering newspapers, historical records, and historical scholarly journals(Boyd-Graber, Jordan & Hu, Yuening & Mimno, 2017)
 In the literary world, topic models are used to analyse the creative, diverse oeuvre of authors and the emotions and thoughts of fictional characters(Boyd-Graber, Jordan & Hu, Yuening & Mimno, 2017).
 With the lots of online discussions across social media platforms, topic models can aid companies understand their customers, politicians target voters and researchers the impact of social media on people’s everyday life through unlocking the emotion and hidden factions often present in online discussions(Boyd-Graber, Jordan & Hu, Yuening & Mimno, 2017).
1.6 Problem Statement
The boom of the Internet and social media (in particular Twitter) is still young in Nigeria. With Nigeria being the second most twitting country in Africa(Portland Communications, 2016) and having 6 million active users (Terragon Group, 2018), lots of data are generated daily via this media daily from which vital information can be retrieved to better decision making. We seek to answer the question: what meaningful insights can be gained from this data through topic modelling? The outcome of this research will spark local utilization of data through topic modelling and further research in this area as it is ever growing.
1.7 Aim and Objectives
The aim of this work is to find out the top health topics in Nigeria over the past few years by applying topic modelling on Nigerian health twitter data to demonstrate the potential of twitter mining.
As a developing country which still battles with numerous health issues, we seek to find out the top health topics in Nigeria over the past few years by achieving the following objectives:
5
 Collect the tweets from major databases
 Use LDA to find out topics for a year
 Use LSA to find out topics for a year
 Compare the results of LDA and LSA to choose the better.
 Perform twitter mining with the better topic model for data from 2015 – 2019.
1.8 Methodology
We follow the following steps in achieving our objectives:
1. The needed data is not domiciled in a location and must thus be collected. Top official health twitter accounts are identified, and their tweets collected through the GetOldTweets3 python library. Since tweets are usually very short (280 characters) we aggregate all tweets from an account to make a document.
2. The data collected is noisy and unstructured hence we next clean it and put it into a structure fit for topic modelling. This step is carried out repeatedly.
3. LDA is next performed on the data to get latent topics.
Python language is used for this thesis on google Collaboratory which provides GPU on the cloud for faster computation.
In the next chapter, we summarize literatures on what has been done in the field of twitter mining. The methodology used is detailed in chapter 3 and the result reported in chapter 4. Chapter 5 states conclusion and suggest further work to be done.

GET THE FULL WORK

DISCLAIMER: All project works, files and documents posted on this website, projects.ng are the property/copyright of their respective owners. They are for research reference/guidance purposes only and the works are crowd-sourced. Please don’t submit someone’s work as your own to avoid plagiarism and its consequences. Most of the project works are provided by the schools' libraries to help in guiding students on their research. Use it as a guidance purpose only and not copy the work word for word (verbatim). If you see your work posted here, and you want it to be removed/credited, please call us on +2348157165603 or send us a mail together with the web address link to the work, to hello@projects.ng. We will reply to and honor every request. Please notice it may take up to 24 or 48 hours to process your request.