Due to the massive increase in medical documents every day (including
books, journals, blogs, articles, doctors’ instructions and prescriptions,
emails from patients, etc.), it is becoming very challenging to handle and
to categorize them manually. One of the most challenging projects in
information systems is extracting information from unstructured texts,
including medical document classification. The discovery of knowledge
from medical datasets is important in order to make effective medical
diagnosis. Developing a classification algorithm that classifies a medical
document by analyzing its content and categorizing it under predefined
topics is the primary aim of this research. In this project work we were
able to succeed in applying Natural Language Processing which is a
branch of Machine Learning to Classifying Health related documents.
We made use of the OpenNLP Application Programming Interface
which is a Java API for training a model and classifying the documents.
framework for building the user interface. The software is also built
using the Model-View-Controller (MVC) architecture. The algorithm
classified the articles correctly under the actual subject headings and got
the total subject headings correct. This holds promising solutions for the
global health arena to index and classify medical documents
This chapter introduces the topic of the project work A System for
Health Document Classification Using Machine Learning. In this
chapter, we will consider the background of the study, statement of the
problem, aims and objectives, methodology used to design the system,
scope of the study, its significance, definition of terms, and we conclude
with the project layout or organization of the project work.
1.1 BACKGROUND OF THE STUDY
Contemporarily, most hospitals, medical laboratories and other health
facilities make use of some kind of information system. These could be
either a hospital management system or a pharmacy management
system. Among other functions that these systems provide, they are
mainly used in collecting patient records. These information systems
stores patient records in digital format. Numerous patient data are being
recorded on a daily basis which forms a large data set popularly referred
to as “Big Data”.
Every day physicians and other health workers are required to work with
this “Big Data” in other to provide solution. Some of the everyday tasks
include information retrieval and data mining. Retrieving information
from big data can be very laborious and time consuming. This has given
rise to the study of text or document classification in other to aid the
process of retrieving information from big data. Today, text
classification is a necessity due to the very large amount of text
documents that we have to deal with daily.
Document classification is the task of grouping documents into
categories based upon their content. Document classification is a
significant learning problem that is at the core of many information
management and retrieval tasks. Document classification performs an
essential role in various applications that deals with organizing,
classifying, searching and concisely representing a significant amount of
information. Document classification is a longstanding problem in
information retrieval which has been well studied (Russell, 2018).
Usually, machine learning, statistical pattern recognition, or neural
network approaches are used to construct classifiers automatically.
Machine learning approaches to classification suggest the automatic
construction of classifiers using induction over pre-classified sample
documents. In this project work we will employ machine learning in
classifying health documents.
1.2 STATEMENT OF THE PROBLEM
With the explosion of information fuelled by the growth of the World
Wide Web it is no longer feasible for a human observer to understand all
the data coming in or even classify it into categories. Also in the health
sector, numerous patient records are being collected everyday and are
used for analysis. How do we efficiently classify or categorize these
health documents to complement easy retrieval.
1.3 AIM AND OBJECTIVES OF THE STUDY
The aim of this project is to develop A System for Health Document
Classification Using Machine Learning.
Other objectives include:
1. Study the various machine learning classification algorithm.
2. Implement classification algorithm in JAVA.
1.4 SCOPE OF THE STUDY
As stated earlier, statistical pattern recognition, or neural network are
used in classifying documents, this project work will concentrate on
using machine learning algorithm to classify document.
1.5 SIGNIFICANCE OF THE STUDY
The software delivered from this project work will greatly reduce the
time used by doctors, physicians and other health workers in searching
and retrieving documents.
Other importance of this project work includes:
1. Helps students and other interested individuals that want to develop a
2. It will serve as source of materials for those interested in investigating
the processes involved in developing a document classification system
using machine learning.
3. It will serve as source of materials for students who are interested in
studying machine learning.
1.6 DEFINITION OF TERMS
Document Classification: is the task of grouping documents into
categories based upon their content.
Health Document: A health certificate is written by a doctor and
displays the official results of a physical examination.
Machine Learning: the study and construction of algorithms that can
learn from and make predictions on data.
JSP: Java Server Pages is a java technology for creating dynamic web
HTML: Hyper Text Markup Language for creating web-pages.
MYSQL: A database management system for creating, storing and
SERVLET: is a small pluggable extension to a Server that enhances the
BOOTSTRAP: is a sleek, intuitive, and powerful mobile first front-end
framework for faster and easier web development. It uses HTML, CSS
1.7 ORGANIZATION OF WORK
Chapter one introduces the background of the project with the statement
of the problems, objectives of the project, its significance, scope, and
constraints are pointed out.
Chapter two reviews literatures on machine learning, document
classification and the review of related literature.
Chapter three discusses system Investigation and Analysis. It deals with
detailed investigation and analysis of the existing system and problem
identification. It also proposed for the new system.
Chapter four covers the system design and implementation.
Chapter five was the summary and conclusion of the project.