Michael Madden IT Department NUI Galway
Home
Bayesian Network Classifiers in Lisp

Bayesian Network Classifiers in Lisp

Dr Michael Madden
Senior Lecturer in Information Technology
College of Engineering & Informatics
National University of Ireland, Galway.

This email address is being protected from spam bots, you need Javascript enabled to view it

Instructions: see below.

Right-click on a file to download it.

madden_bn_v36.lsp LISP implementation of the various Bayesian Network classifiers.
weather_defines.lsp
weather_data.lsp
Small sample input files
name2bn
data2bn
mlc2bn
Perl programs to convert files from MLC++ format to this software format
lcurve Perl program for running a batch of analyses
runperl.bat For Windows users, a handy batch file for running Perl programs that don't have any file extension (as in Unix): copy this file, give it the same name as the Perl program (e.g. lcurve.bat to go with lcurve above), and put it in the same folder as the Perl program.

Instructions

This describes the basics of how to use my implementations of General Bayesian Network classifiers and other Bayesian classifiers. For further details, please read the code.

For a description of the classifiers, and a discussion of their relative performance, see: “On the Classification Performance of TAN and General Bayesian Networks” , Michael G. Madden. Knowledge Based Systems, 2009. http://datamining.it.nuigalway.ie/images/pubs/kbs-jan2009-mmadden.pdf.

Please cite that paper if you use this software in your work.

If you have problems using this code or if you think you have found a bug in it, you are welcome to contact me.

Requirements:

  1. A Common Lisp implementation -- I use CLISP (Bruno Haible) on Unix and Windows
  2. For supporting scripts, Perl Version 5
  3. DOT/DOTTY, for displaying network structures
  4. To produce data for learning curves on continuous-valued data, the lcurve program below uses the MLC++ discretize utility

Input File Formats:

See example files for the tiny Weather (a.k.a. PlayTennis) datset: weather_defines.lsp and weather_data.lsp. Note that, unrealistically, the testing data is just a copy of the training data in this example.

The following global variables must be defined before running the functions:

*nodes*
list of node names
*assmnts*
For each node, a list of possible values that can be assigned to it
*class_node*
Index (from 0) of the classification node in *nodes*
*train*
List of training cases, where each case is a list of value assignments for each node
*test*
List of testing cases (same format as *train*)
verbose
0: very little output. 1: normal output. 2+: debug output. 0 recommended if running functions in a batch.

Generating and Evaluating Bayesian Classifiers

The procedure for all of the classifiers is the same apart from one function call and one function parameter. The full procedure is described for the Naive Bayes classifier, and for the others only the differences are described.

In these instructions, I assume the current version of the BN Lisp software has been compiled and is loaded. In CLISP, this is done with the following commands (assuming the file containing the software is called madden_bn_v36.lsp):

(compile-file "madden_bn_v36") (load "madden_bn_v36") 

Format of Structure Representation

The functions to construct Bayesian structures -- (naive), (tanbn), (fggtan), (k2 N) -- all return a representation of the structure and also define a global variable with the return value; poor programming practice perhaps, but useful when experimenting with the different structures. The structure representation is a list of lists, where the head (car) of the inner list is a node and its tail (cdr) is a list of parent nodes. Nodes are represented numerically according to their position in *nodes*. To visualise the structure, use the DOT function described below. The name of the global variable depends on which classifier is used: see the descriptions of the classifiers below for details.

Naive Bayes Classifier

Inside your Lisp environment:
(load "weather_defines")       ;; NOTE: the names of these files may vary; thieir purpose is to  (load "weather_data")          ;;       define *nodes*, *assmnts*, *class_node*, *train*, *test*  (setq verbose 1)               ;; Verbose mode: set to 0 if running in a batch  (naive)                        ;; Construct Naive structure -- result is *naive* (probs *naive*)                ;; Generate probabilities for structure in *naive* (test *naive* *probs* *test*)  ;; Test the network given by structure in *naive* and probablities in *prob* on the test dataset *test* (stats *testres*)              ;; Display statistics from the result of the test, which have been  stored in *testres* 

Tree Augmented Naive Bayes Classifier:

Just as above, except the call to construct the structure is:
(tanbn)            ;; Construct TAN structure -- result is *tanstruct* 
Then use *tanstruct* rather than *naive* in subsequent function calls, e.g.
(probs *tanstruct*) 
Note that this uses whatever the currently-specified metric is. If you want to construct a TAN classifer using Friedman et al's original Conditional Mutual Information metric, either set *metric* to CMI or use the (FGGTAN) convenience funtion.

General Bayesian Network Structure:

K2 is used to construct General Bayesian Networks (GBNs). The call to construct the structure is:
(k2 99)            ;; Construct K2 structure -- result is *k2struct* 
where the function parameter is the maximum number of parents any node may have.

Then use *k2struct* in subsequent function calls, e.g.

(probs *k2struct*) 
Incidentally, on the very small sample weather dataset, K2 does not work well.

Note that you can use this function to create a Bayesian Network augmented Naive Bayes structure, by adding an additional parameter with value T:

(k2 99 T)            ;; Start with a Naive Bayes structure and augment it with K2 -- result is *k2struct* 

Other Features of the Code

To time programs:

(tstart)    ;; put this call in before constructing structure and calculating probs (tstop)     ;; displays elapsed real time and CPU time, calls to g-function 
To display a confusion matrix: 
(confuse *naive* *probs* *test*) 

To generate a graphical display of the network:

(dot *k2struct* "filestem") 
This function writes out a DOT file to filestem.dot, which can be displayed on screen (from your command line, not in Lisp) using
	dotty filestem.dot 
or converted to PS (or other format, eg PNG) using
	dot -Tps filestem.dot -o filestem.ps 
Important: If using DOT, you cannot have '-' characters in your node names.

To generate data for a ROC curve:

(roc structure probabilities testdata classindex filename)
This function writes to filename the (x,y) points of a ROC curve with 200 evenly-spaced interpolated points.

For multiclass problems, the ROC curve is based on one class only: classindex is the index (counted from 0) of the classification variable category for which the ROC curve is to be generated.

Supporting Programs

I have written some Perl programs to facilitate performing analyses with the BN software. You can use them as they are or modify them to suit your own needs.

If you run any of these programs without arguments, they display a usage message. If you run them with a -v switch, they display verbose messages.

To convert files from MLC++ format

You can download some well-known datasets from the MLC++ web-page. MLC++ files are organised as follows: filestem.names describes the data, filestem.data contains training data, filestem.test contains testing data, and filestem.all contains all data, if using a procedure that will automatically split into training and testing sets.

There are three conversion programs:

name2bn
name2bn opt:-v filestem
Converts filestem.names into defines.lsp
data2bn
name2bn opt:-v filestem.all training-percentage rand-seed
OR:
name2bn opt:-v filestem.data
Converts MLC++ data files into data.lsp.
In the first form, you specify filestem.all and the program divides it into training and testing data, according to the training-percentage and random number seed that you specify. In the second form, you specify filestem.data but filestem.test must also exist, and the program takes the training and testing data from these.
mlc2bn
mlc2bn opt:-v opt:-n opt:-d filestem train-percentage rand-seed
Without either the -n or -d switch, this operates like the first form of data2bn above.
With -n, it also calls name2bn to produce defines.lsp.
With -d, it calls the MLC++ disretize program to discretize the data. This is useful if the data files contain continuous-valued data, as the BN implementation cannot handle it. It is simplest to use -d always, as it doesn't hurt the data files if all variables are already discrete/categorical, but the option is there in case discretize is not installed.

Note: To change the discretization algorithm, edit the program and change $discopts.

To perform a set of analyses for a learning curve

The program lcurve performs a set of analyses and stores raw data from which a learning curve can be plotted for a given dataset and algorithm. Usage:
	lcurve datafile_stem alg_name 
Here, datafile_stem is the name of a dataset in MLC++ format (.names and .all files must exist), and alg_name is one of naive, tan, or k2.

This program uses the mlc2bn and name2bn programs already described, and optionally the MLC++ discretize program. It operates as follows:

  • Repeat for several training set sizes:
    • Repeat N times:
      • Make training/testing data file
      • In LISP, run the training algorithm and test it
      • Store the run details and individual results
    • End repeat
  • End repeat
To specify the training set sizes and how many times the procedure is repeated for each size, you must edit the file and change the parameters in its clearly-marked "Configuration Variables" section. In the same section, you can specify the path of the lisp code. the command to run your Lisp interpreter, and whether to discretize the data.

The program writes out files called alg_name.csv and alg_name.log. The CSV file lists, for each run, the training set size, correct, incorrect and 'unknown' classifications (unknown = equal probability for all classes). You just need to average the sets of results to plot a learning curve in your favourite graphing package. The LOG file essentially contains an echo of the information displayed on screen.

 
New PhD scholarships - Sept 2014

September 2014:

We have 1 or 2 extra PhD scholarships, including a stipend of up to €13,000 tax free per annum + fees + some travel money, available for highly qualified applicants. 

If you are interested in working with me in the Data Mining and Machine Learning Research Group, please contact me by email for details.

Applicants should be available to start by October 2014. Applicants should have:

  • A BSc or MSc in Computer Science or a closely related discipline, with a track record of scholarship that indicates excellent potential for a PhD research
  • Excellent skills in communicating and helping others learn, as these activities will be required as part of the scholarship conditions
  • Evidence of strong programming abilities and comfort with mathematical concepts.

Your research can be in any topic related to my research group's interests. This link has some research topic ideas.

For these scholarships, there will be an evaluation process involving shortlisting based on your background, an interview, and an assessment of your ability to tutor students in multiple programming languages. 

Deadline: Contact me as soon as possible. Applications must be received by mid September.

 
Research Overview

The Data Mining and Machine Learning Group is led by Michael Madden of the College of Engineering & Informatics, NUI Galway.

Our research is focused on new theoretical advances in machine learning and data mining, motivated by important practical applications, on the basis that challenging applications foster novel algorithms which in turn enable new applications.

Specific research topics include:

  • Artificial intelligence, data mining & machine learning
  • Algorithms for classification and numeric prediction
  • New methods for combining domain knowledge with data mining
  • Time series data analysis
  • Probability, reasoning under uncertainty, and Bayesian networks
  • Reinforcement learning
  • Practical applications of data mining and machine learning in science, engineering & medicine.

For more specific information, take a look at our current and recent research projects, our publications, and the current and past members of our group. You can also contact me with questions.

 
Nonlinearity and Uncertainty in Drug Modelling

Project Overview

This is a strongly interdisciplinary collaborative project, involving Informatics, Mathematics and Applied Mathematics, with an application in Medical Informatics. The project collaborators include researchers in Mathematics and Applied Mathematics in NUI Galway, the Intensive Care Unit at University Hospital Galway, as well as Prof. Stuart Russell’s group in Computer Science in the University of California at Berkeley.

In an ICU ward, computer systems constantly monitor and record data such as changes in heart rate and blood pressure. This project will develop software techniques that uses this data, and will add new mathematics and computing techniques that infer patient response to certain drugs in real-time, to predict the impact on the patient of different drug regimes. This will allow for drug treatments to be tailored to the individual. This contrasts with the current practice of basing treatments on standardised data, taken from clinical trials using healthy volunteers, who might not respond in the same way as sick patients. 

Read more...
 

© M Madden