Data mining and its application on vector borne diseases

Datamining

Data mining (knowledge discovery), is the extraction of hidden predictive information from large databases, a powerful technology with great potential to help organizations, companies focus on the most important information in the data warehouses. Data mining is a young and interdisciplinary field, drawing from fields such as database systems, data warehousing, machine learning, statistics, signal analysis, data visualization, information retrieval, and high performance computing. It has been successfully applied in diverse areas such as marketing, finance, engineering, security, games, and science. And rather than comprising a clear cut set of methods, the term “data mining” refers to an eclectic approach to data analysis where choices are led by pragmatic considerations concerning the problem at hand. Broadly speaking, the goals of data mining can be classified into two categories: description and prediction. Descriptive data mining attempts to discover implicit and previously unknown knowledge, which can be used by humans in making decisions. Predictive data mining seeks to find a model or function that predicts some crucial but (yet) unknown property of a given object or a set of currently known properties. Predictive data mining tasks are typically supervised machine learning problems such as regression and classification. Well-known supervised learning algorithms are decision tree learners, rule-based classifiers, Bayesian classifiers, linear and logistic regression analysis, artificial neural networks, and support vector machines.

The core components of data mining technology have been under development for decades, in research various areas such as statistics, artificial intelligence, and machine learning. Today, the maturity of these techniques, coupled with high-performance relational database engines and broad data integration efforts, make these technologies practical for current data warehouse environments. Data mining software analyzes the relationships and patterns in stored transaction data based on open-ended user queries. Generally, there are four types of relationships:

  • Classes: Stored data is used to locate data in predetermined groups.
  • Clusters: Data items are grouped according to logical relationships or consumer preferences.
  • Associations: Data can be mined to identify associations.
  • Sequential patterns: Data is mined to anticipate behavior patterns, predictions and trends.

Different levels of analysis in Datamining are:

  • Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.
  • Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.
  • Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID). They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome.
  • Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique.
  • Rule induction: The extraction of useful if-then rules from data based on statistical significance.
  • Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships.

Ref:
http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm
http://www.thearling.com/text/dmwhite/dmwhite.htm

Applications Datamining in Vector Borne Diseases.

Data mining can be applied in biomedicine for a large variety of purposes, and is thus connected to diverse biomedical fields. Traditionally, data mining and machine learning applications focused on clinical applications, such as decision support to medical practitioners and interpretation of signal and image data. More recently, applications in epidemiology, Medical entomology, bioinformatics, and bio surveillance have received increasing attention. From the past few years, Biology Division of IICT is working on the control of vectors and vector borne disease by using various datamining tools (like Self Organizing Maps, CART, kNN and Bayesian network) on Malaria, Filariasis, Dengue, Chikungunya and Japanese Encephalitis. Datamining tool “Self Organizing Map” (SOM) was customized for prioritization of endemic zones of filariasis and malaria which can enable in efficient targeting of the risk areas for control operations. Data mining tools like CART (Correlation and Regression Tree) can also be applied for determining various association rules for effective controlling of vector borne diseases like filariasis.  

Bayesian model (JEBNET) that has been developed by IICT for predicting vector density (per man hour density (PMHD)) of Japanese encephalitis (JE) mosquitoes. This software is capable of predicting the PMHD of mosquitoes one year in advance. The algorithm calculates the mosquito population which can occur within maximum likelihood of the given area.

Based on the kNN approach, a novel tool VB Classif ver.1.0 for classification of epidemiological data of vector-borne diseases was developed by IICT. The VBClassif tool classifies disease according to presence or absence of filarial cases which help in devising a clear strategy in mass drug administration programmes. This also helps in proper targeting of patients and in efficient use of resources. Interactive Classification tools supported by AI (Artificial Intelligence) like VB Classif 1.0 will definitely pave way for more efficient disease control and help epidemiologists in finding quick solution to classification problems and classifies the records of those affected by Filariasis. This software has been utilized to classify up to a 100000 records and the effective classification yield percentage is 94%.

For more information:

Sitemap | Contact Us
Website Designed & Maintained By Bioinformatics Group, Indian Institute Of Chemical Technology. Hosted at