Instant Download

Download your project material immediately after online payment.

Project File Details


3,000.00

100% Money Back Guarantee

File Type: MS Word (DOC) & PDF

File Size: 1,019 KB

Number of Pages:57

 

ABSTRACT

The telecommunication industry has a lot of data related to households, individuals and devices. Advertisers pay a premium to ensure they advertise to their target audience. To ensure that content is personalized, it is necessary to accurately predict who is using a device in real time. A probabilistic matching algorithm to determine the profile of an individual based on behavioural analytics is developed and implemented. Two datasets ‘People data’ and ‘Device data’ were linked and matched using social behaviours exhibited by individuals whose information are contained in the People data and by devices whose addresses show specific social behaviours of individuals who use the devices. A match score was generated to show the accuracy of a pair of records from the different datasets (i.e. to show if both records are indeed a match or not).
Key words: Probabilistic Matching

 

 

TABLE OF CONTENTS

LIST OF FIGURES …………………………………………………………………………………………. ix
LIST OF TABLES ………………………………………………………………………………………….. ix
CHAPTER 1 …………………………………………………………………………………………………… 1
1.1. INTRODUCTION ………………………………………………………………………………………. 1
1.2. BACKGROUND STUDY ……………………………………………………………………………. 2
1.2.1. On Record Linkage ………………………………………………………………………. 2 1.

2.1. On Identity Management ……………………………………………………………….. 3
1.3. LIMITIATIONS OF SOME WORKS …………………………………………………………….. 4
1.4. PROBLEM STATEMENT …………………………………………………………………………… 4
1.5. AIM OF STUDY ………………………………………………………………………………………… 5
1.6. OBJECTIVES …………………………………………………………………………………………… 5
1.7. TECHNOLOGIES REQUIRED …………………………………………………………………… 5
1.8. DEFINITION OF TERMS …………………………………………………………………………… 6
1.9. GENERAL PROBLEMS OF PROBABILISTIC MATCHING …………………………… 7
1.10. PROJECT SCOPE ………………………………………………………………………………….. 8
CHAPTER 2 …………………………………………………………………………………………………… 9
2.1. RECORD LINKAGE ………………………………………………………………………………….. 9
2.1.1. Deterministic Record Linkage Versus Probabilistic Record Linkage ….. 9
2.2. IDENTITY MANAGEMENT ………………………………………………………………………… 9
2.2.1. Authorization Versus Authentication……………………………………………….. 9
2.3. RELATED WORKS …………………………………………………………………………………. 10
CHAPTER 3 …………………………………………………………………………………………………. 17
3.1. METHODOLOGY ……………………………………………………………………………………. 17
3.1.1. General Problems of Record Linkage That Requires Probabilistic Matching …………………………………………………………………………………………….. 18

3.2. PROBABILISTIC MATCHING …………………………………………………………………… 18
3.2.1. Probabilistic Matching …………………………………………………………………. 18
3.2.2. Performing Probabilistic Matching ………………………………………………… 20
3.2.3. Probabilistic Matching Algorithm…………………………………………………… 20
3.2.4. Mathematical Implications of Probabilistic Matching……………………….. 22
3.2.5. Fellegi-Sunter Model …………………………………………………………………… 24
3.2.6. Processes in Probabilistic Matching ……………………………………………… 24
CHAPTER 4 …………………………………………………………………………………………………. 34
4.1. MATCH SCORE AND POSITIVE PREDICTIVE VALUE (PPV) ……………………. 35
4.1.1. Manual Calculation of Match Score and PPV ………………………………… 35
4.1.2. Using Fuzzywuzzy Library for PPV and Match Score……………………… 36
4.1.3. Comparing Manual Matching, Fuzzywuzzy and EM Algorithms ………. 38
4.2. STRING COMPARATORS ……………………………………………………………………… 38
4.3. THE MATCHED DATASETS …………………………………………………………………… 39
CHAPTER 5 …………………………………………………………………………………………………. 41
5.1. CONCLUSION ………………………………………………………………………………………… 41
5.2 CHALLENGES ………………………………………………………………………………………… 41
5.3. RECOMMENDATION ……………………………………………………………………………… 42
5.4. CONTRIBUTIONS …………………………………………………………………………………… 42
5.5. FUTURE WORKS …………………………………………………………………………………… 42
5.6. APPLICATIONS OF PROBABILISTIC MATCHING ……………………………………. 42
APPENDIX
REFERENCES

 

 

CHAPTER ONE

1.1. INTRODUCTION
The Telecommunications industry is one of the subsectors that make up the Information and Telecommunication Technology sector. This industry includes all telephone companies, Internet Service Providers (ISP), radio companies and television companies. The Telecommunication industry gets wider and more complex due to the proliferative nature of the devices involved.
The Telecommunication industry is a very high revenue generating company. Research has it, that due to the increasing scope of the Telecommunication industry, telecommunications service revenue will grow from $2.2 trillion in 2015 to $2.4 trillion in 2019. A way to achieve this is through advertisement. Advertisers pay huge amount of money to advertise their services. So, there is the need to advertise products and services and more so, to advertise to the target user. When products and services are advertised to the target audience, there is higher chance of companies selling and users purchasing. Therefore, there is a need to know who uses a device at a particular time and what such user is interested in. This rapid increase in the number of devices available allows individuals (or households, as the case may be) to own more than one device. The need for the identification of users per time and also for advertisement to target audience is where Identity Management is taken into consideration.
Since Identity Management deals with individuals, different attributes of individuals are used to implement it. Attributes of individuals are classified into personal attributes, social behavior attributes and social relationship attributes (Li & Wang, 2015). To carry out efficient Identity Management, attributes and individuals themselves must be accurately matched. Hence, the use of a matching algorithm for best match.
2
1.2. BACKGROUND STUDY
1.2.1. On Record Linkage
The term record linkage which was introduced by Halbert L. Dunn through his paper “Record Linkage” published in 1946, was referred to as the linking of medical records associated with individuals. Halbert Dunn described a system developed by the Dominion Bureau of Statistics in Canada for which information containing names of individual from microfiche was put on punch cards and after this, lists were printed for verification and review by different agencies in Canada. The methods above were cost‐effective at the time because they were far more efficient than purely manual matching and maintenance of paper files.
Generally, computerised record linkage began with methods introduced by geneticist Howard Newcombe (in his papers Automatic Linkage of Vital Records published in 1956 and Record Linkage: Making Maximum use of the discrimination power of identifying information published in 1962) who used odds ratios and value‐specific frequencies (for example common value of last name ‘Smith’ has less distinguishing power than rare value ‘Zabrinsky’). Then Fellegi and Sunter (in their 1969 paper, A Theory for Record Linkage) gave a mathematical formalisation of Newcombe’s ideas. They proved the optimality of the classification rule of Newcombe and introduced many ideas about estimating ‘optimal’ parameters (probabilities used in the likelihood ratios) without training data. Training data, which makes suitable parameter estimation much easier, is a set of record pairs for which the true matching status is known, created, for example, through certain iterative review methods in which ‘true’ matching status is obtained for large subsets of pairs (Winkler, 2015).
3
1.2.2. On Identity Management
Due to the ubiquitous nature and the rapid rate of development of the technology and web applications world, access to different applications are made quite easy for illegal users. The developers are then driven to create more secure environments for applications by allowing for more careful control.
Identity Management is dated as far back as the 19th century where in 1853, the government of the United Kingdom made it compulsory for citizens to register new births and by 1902, the entire United States was standardized. In the 20th century, in the united states, the first driver’s license , the first passport, the first Social Security Number, the first digital identities and passwords and commercial internet was born. The use of passwords was introduced to keep the information about individuals and bodies private. In those times, Identity management was generally made up of manual sheets and other services used to track accounts. As soon as commercial internet was born, Traditional Identity Management systems were adapted for online applications.
In the year 2000, the population of internet users grew to about 400 million people used and this increased the vices, such as identity theft, performed by and on people through the internet. Due to the need to stop these vices, an effective and efficient system was developed and Identity Management Stack was birthed. This stack system had a limitation though – it was very expensive to maintain
In 2010, Identity as a Service cloud was created with the aim of simplifying, automating and reducing costs associated with the Stack. From 2010 till date, Identity Management has been fully digitized and is in successful use in today’s computing.
4
1.3. LIMITATIONS OF SOME WORKS
Although, extensive research has been going on as regards Identity Management, most works have focused solely on the personal attributes of individuals and paid little or no attention to behavioural attributes of individuals (Li & Wang, 2015). Some works also dwelled on just supervised and semi-supervised technique of machine learning for probabilistic linkage(Diaz-Morales, 2015).
It has also been observed that probabilistic matching has not been applied to the telecommunication industry to a large extent. This work attempts to explore and implement probabilistic matching algorithms on data from telecommunication industries.
1.4. PROBLEM STATEMENT
New products and services are being thought of, designed, implemented and released frequently and such services can best reach individuals through advertisement. The fastest forms of advertisement are those that are done through telecommunication devices. The audience and can see and hear about what is being advertised billions of miles away. It is one thing to advertise to the public, it is another thing to advertise to the target individual. If the target individual is not reached, sales of such services will be low which will result to low profit or even loss for the company doing the advertisement and for the telecommunication industry at large. Also, for identity management, the use of both personal attributes and behavioural attributes of an individual should be taken into consideration. Also, all techniques of machine learning should be incorporated and the technique that produces the best match should be noted.
5
1.5. AIM OF STUDY
The aim of this study is to develop a probabilistic matching algorithm to link individuals to devices hence, knowing the profile of an individual (which includes individual’s preferences and interests) and advertising most suitable products to such individuals.
1.6. OBJECTIVES
I. Collate data, perform data quality assessment and data cleansing.
II. Group attributes of individuals into Personal attributes and Social (behavioural) attributes.
III. Use Machine Learning Techniques to develop a probabilistic matching algorithm to determine the profile of an individual (Identity Management) based on behavioural analytics.
1.7. TECHNOLOGIES REQUIRED
a) Machine Learning:
Machine learning is usually seen as a subset of Artificial Intelligence and it is defined as a the scientific study of algorithms, statistical models and other features that computer systems can use so that, without little or no human intervention, they can perform their tasks, relying on patterns and inferences instead. Algorithms developed in Machine Learning usually build a model using a sample data that is generally referred to as the training data. Analysis from machine learning can be predictive, exploratory, descriptive or prescriptive – each analysis, largely dependent on the Machine Learning Technique used.
6
b) Supervised Learning and Unsupervised Learning:
Machine learning is divided generally into Supervised Learning and Unsupervised Learning. In supervised learning, there is a particular output that is needed to be gotten from the system and so the input is manipulated and worked on until such output is gotten. For unsupervised learning, no specified input is given to the system. The system receives input data and produces an output data. Inference and deductions are drawn from the output of the system.
c) Data Analysis:
Data Analysis, sometimes referred to as Data Modelling and having many approaches to it, is a process of inspecting, cleansing, transforming and modelling data with the aim of discovering useful information and supporting decision making. Descriptive Statistics, Exploratory Data Analysis (EDA) and Confirmatory Data Analysis (CDA) are the three main classification of Data Analysis.
d) Probabilistic Matching:
It is also known as Fuzzy matching or record linkage. It is the task of finding records in dataset and such that this records refer to the same entity or in this case, individuals even if the data are from different sources. It takes into account a wider range of potential data identifiers, computes weights for each identifier based on the predicted ability for this identifier to identify a match and then uses the computed weight to check if a record pair refer to the same entity.
1.8. DEFINITION OF TERMS
Advertisement:
A marketing communication that makes use of an openly sponsored non-personal message to create, develop, promote and sell a product or a service. Advertisements
7
can be done through various communication media outlets such as newspapers, magazines, radio, television, blogs, social media etc. Advertisement can be classified by style, target audience, geographic scope and purpose.
Identity Management:
It is basically the authorization and authentication of a user to grant the user access to resources made available. It also determines to an extent, what the user will do with such resources. In Identity Management, some key areas are to be undergone in order to ensure accuracy. The areas include Directory services, Identity Administration and Access Management. Directory Services allow files and resources to be located and also allows for access of user data. Identity Administration monitors the lifestyle of the data i.e. how often data changes and also effects the changes. Access Management, used interchangeably with Identity Management. It authorizes and authenticates (verifies and validates) users of a particular resource by accessing their data and deeming them fit to access the data or not (Oracle Corporation, 2008).
Probabilistic Matching Algorithms:
These are algorithms which are developed and implanted in order to make use of record linkage (probabilistic matching) to work on data such that the data needs less human intervention.
1.9. GENERAL PROBLEMS OF PROBABILISTIC MATCHING
i. There is no personal or key identifier on one or both datasets to be matched.
ii. There could be the problem of missing data in the dataset(s).
iii. Some information such as gender, address, state etc. could be made available for matching.
8
iv. Some Identifiers have more weight or discriminatory powers than other identifiers.
v. Maximum and minimum threshold to determine matching state of identifiers are not easily known.
1.10 . PROJECT SCOPE:
Chapter 1: This covers the introduction, background study, the problem statement of the project, the aim and the objectives of the work. It also covers the definition of terms, the technologies required and the general problems of probabilistic matching.
Chapter 2: This chapter contains the literature review of related works to this project and through the review, show how this project stands in its own unique way and how past works can be of great help in achieving the aims of this project.
Chapter 3: In this chapter, the materials and methods used for the project are discussed and implemented. The python programming language was used to prepare codes that were used to implement this project. Different python libraries were used to carry out some specific processes
Chapter 4: The testing and validation of the project was analyzed in this chapter. The testing of the project was divided into three parts; the data matching itself, the comparison of string comparators on the set of matched and unmatched data and lastly, comparing and confirming matching score and Positive Predictive Values gotten from the Fuzzywuzzy record matching, manual reviews and EM (Expectation Maximization) Algorithms.
Chapter 5: Conclusions, contributions, challenges, recommendations and future works to be done comprises this chapter.

 

GET THE FULL WORK

DISCLAIMER: All project works, files and documents posted on this website, projects.ng are the property/copyright of their respective owners. They are for research reference/guidance purposes only and the works are crowd-sourced. Please don’t submit someone’s work as your own to avoid plagiarism and its consequences. Most of the project works are provided by the schools' libraries to help in guiding students on their research. Use it as a guidance purpose only and not copy the work word for word (verbatim). If you see your work posted here, and you want it to be removed/credited, please call us on +2348157165603 or send us a mail together with the web address link to the work, to hello@projects.ng. We will reply to and honor every request. Please notice it may take up to 24 or 48 hours to process your request.