<html>


<!-- Mirrored from www.jenner.ac.uk/YBF/tan.htm by HTTrack Website Copier/3.x [XR&CO'2003], Fri, 25 Jun 2004 08:29:52 GMT -->
<head>
<meta name="GENERATOR" content="Microsoft FrontPage 5.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<meta http-equiv="Content-Language" content="en-us">
<title>New Page 1</title>
</head>

<body>

<div align="center">
  <center>
  <table border="0" cellpadding="0" cellspacing="8" width="98%">
    <tr>
      <td align="right" valign="top" width="20%">&nbsp; </td>
      <td width="15"></td>
      <td bgcolor="lightblue" valign="bottom" width="80%">
      <p align="center"><b>
      <font FACE="Times New Roman" size="5">Multi-class protein fold
classification using an integrative machine</font></b><font size="5"> </font><b><font FACE="Times New Roman" size="5">learning
approach</font></b></td>
    </tr>
    <tr>
      <td bgcolor="lightblue" valign="top" width="20%">
      <font FACE="Times New Roman">
<p ALIGN="center"><font size="4">Aik Choon Tan&nbsp;</font></p>
<p ALIGN="LEFT">
<b>
Supervisor:</b>  Prof David Gilbert</p>
<b>
<p>School:</b>  Bioinformatics Research Centre, Department of Computing Science,
University of Glasgow.</p>
</font>
      <p><font size="3"><br>
      <br>
      </font></td>
      <td width="15"></td>
      <td valign="top" width="80%"><font FACE="Times New Roman">
<p ALIGN="LEFT">&nbsp;</p>
<p ALIGN="LEFT">One of the current research trends in machine learning applied
to bioinformatics is to combine several sophisticated learning algorithms in
order to increase a classifier’s predictive accuracy (credibility) and its
explanatory power (comprehensibility). When trying to learn from large and
diverse data sets (e.g. biological databases) it is important to produce
hypotheses that encapsulate all the information from different sources. The
classifiers that are used to characterise and/or classify the data must be
accurate and easily understandable by the human expert. Most methods in
bioinformatics only concentrate on the classifier’s credibility and less often emphasize</font> <font FACE="Times New Roman">its
comprehensibility.</p>
<p ALIGN="LEFT">For some multi-class classification problems (e.g. C</font><font FACE="Times New Roman" SIZE="1">1</font><font FACE="Times New Roman">,
C</font><font FACE="Times New Roman" SIZE="1">2</font><font FACE="Times New Roman">,
C</font><font FACE="Times New Roman" SIZE="1">3</font><font FACE="Times New Roman">,
…, C</font><font FACE="Times New Roman" SIZE="1">n</font><font FACE="Times New Roman">),
the set of</font> <font FACE="Times New Roman">positive examples (C</font><font FACE="Times New Roman" SIZE="1">1</font><font FACE="Times New Roman">)
is very small compared to the set of negative examples (C</font><font FACE="Times New Roman" SIZE="1">2</font><font FACE="Times New Roman">,</font>
<font FACE="Times New Roman">C</font><font FACE="Times New Roman" SIZE="1">3</font><font FACE="Times New Roman">,…,C</font><font FACE="Times New Roman" SIZE="1">n</font><font FACE="Times New Roman">);
this is the common scenario in the functional annotation problem where there
exist a lot of classes but the number of the examples (proteins) in each class
is</font> <font FACE="Times New Roman">relatively low. This imbalanced
proportion of examples in each class contributes to the poor performance of
standard machine learning techniques (e.g. decision trees).</p>
<p ALIGN="LEFT">Existing machine learning approaches tend to produce a strong
discrimination classifier (high accuracy) with very low sensitivity (also called
completeness) when learning on these types of problem.</p>
<p ALIGN="LEFT">The aim of this research is to construct a novel approach to
integrate rules/patterns</font> <font FACE="Times New Roman">induced from
multi-class and unbalanced data sets; and to demonstrate its usefulness in
biological data. Specifically this method has been designed to increase the
sensitivity of the classifiers. However, one consequence of this approach is the
decrease in classifier’s specificity.</p>
<p ALIGN="LEFT">We have applied this method to multi-class protein fold
classification. The data set</font> <font FACE="Times New Roman">contains 600
examples for 25 SCOP folds. We showed that this approach is useful when the
ratio of positive/negative examples is very low, and when the initial
classifiers yield little sensitivity. In this case, the loss of specificity is
small compared to the increase of sensitivity, yielding more useful classifiers.
We are now working on improving the specificity of the integrated classifiers.</font></td>
    </tr>
  </table>
  </center>
</div>

</body>


<!-- Mirrored from www.jenner.ac.uk/YBF/tan.htm by HTTrack Website Copier/3.x [XR&CO'2003], Fri, 25 Jun 2004 08:29:52 GMT -->
</html>