PMID- 16723004
OWN - NLM
STAT- MEDLINE
DCOM- 20060630
LR  - 20240413
IS  - 1471-2105 (Electronic)
IS  - 1471-2105 (Linking)
VI  - 7 Suppl 1
IP  - Suppl 1
DP  - 2006 Mar 20
TI  - A regression-based K nearest neighbor algorithm for gene function prediction from 
      heterogeneous data.
PG  - S11
AB  - BACKGROUND: As a variety of functional genomic and proteomic techniques become 
      available, there is an increasing need for functional analysis methodologies that 
      integrate heterogeneous data sources. METHODS: In this paper, we address this 
      issue by proposing a general framework for gene function prediction based on the 
      k-nearest-neighbor (KNN) algorithm. The choice of KNN is motivated by its 
      simplicity, flexibility to incorporate different data types and adaptability to 
      irregular feature spaces. A weakness of traditional KNN methods, especially when 
      handling heterogeneous data, is that performance is subject to the often ad hoc 
      choice of similarity metric. To address this weakness, we apply regression 
      methods to infer a similarity metric as a weighted combination of a set of base 
      similarity measures, which helps to locate the neighbors that are most likely to 
      be in the same class as the target gene. We also suggest a novel voting scheme to 
      generate confidence scores that estimate the accuracy of predictions. The method 
      gracefully extends to multi-way classification problems. RESULTS: We apply this 
      technique to gene function prediction according to three well-known Escherichia 
      coli classification schemes suggested by biologists, using information derived 
      from microarray and genome sequencing data. We demonstrate that our algorithm 
      dramatically outperforms the naive KNN methods and is competitive with support 
      vector machine (SVM) algorithms for integrating heterogenous data. We also show 
      that by combining different data sources, prediction accuracy can improve 
      significantly CONCLUSION: Our extension of KNN with automatic feature weighting, 
      multi-class prediction, and probabilistic inference, enhance prediction accuracy 
      significantly while remaining efficient, intuitive and flexible. This general 
      framework can also be applied to similar classification problems involving 
      heterogeneous datasets.
FAU - Yao, Zizhen
AU  - Yao Z
AD  - Department of Computer Science and Engineering, AC101 Paul G. Allen Center, 
      University of Washington, Seattle WA 98195, USA. yzizhen@cs.washington.edu
FAU - Ruzzo, Walter L
AU  - Ruzzo WL
LA  - eng
PT  - Journal Article
DEP - 20060320
PL  - England
TA  - BMC Bioinformatics
JT  - BMC bioinformatics
JID - 100965194
RN  - 0 (Escherichia coli Proteins)
SB  - IM
MH  - Algorithms
MH  - Artificial Intelligence
MH  - Cluster Analysis
MH  - Computational Biology/*methods
MH  - Computer Simulation
MH  - Escherichia coli Proteins/chemistry
MH  - *Gene Expression Regulation
MH  - *Genes, Bacterial
MH  - Genome, Bacterial
MH  - Models, Genetic
MH  - Neural Networks, Computer
MH  - Oligonucleotide Array Sequence Analysis
MH  - Pattern Recognition, Automated
MH  - Probability
MH  - Regression Analysis
MH  - Reproducibility of Results
MH  - Sequence Analysis, Protein
PMC - PMC1810312
EDAT- 2006/05/26 09:00
MHDA- 2006/07/01 09:00
PMCR- 2006/03/20
CRDT- 2006/05/26 09:00
PHST- 2006/05/26 09:00 [pubmed]
PHST- 2006/07/01 09:00 [medline]
PHST- 2006/05/26 09:00 [entrez]
PHST- 2006/03/20 00:00 [pmc-release]
AID - 1471-2105-7-S1-S11 [pii]
AID - 10.1186/1471-2105-7-S1-S11 [doi]
PST - epublish
SO  - BMC Bioinformatics. 2006 Mar 20;7 Suppl 1(Suppl 1):S11. doi: 
      10.1186/1471-2105-7-S1-S11.