PMID- 17930123 OWN - NLM STAT- PubMed-not-MEDLINE DCOM- 20080206 LR - 20220310 IS - 1539-3755 (Print) IS - 1539-3755 (Linking) VI - 76 IP - 2 Pt 2 DP - 2007 Aug TI - Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. PG - 026209 AB - Commonly used dependence measures, such as linear correlation, cross-correlogram, or Kendall's tau , cannot capture the complete dependence structure in data unless the structure is restricted to linear, periodic, or monotonic. Mutual information (MI) has been frequently utilized for capturing the complete dependence structure including nonlinear dependence. Recently, several methods have been proposed for the MI estimation, such as kernel density estimators (KDEs), k -nearest neighbors (KNNs), Edgeworth approximation of differential entropy, and adaptive partitioning of the XY plane. However, outstanding gaps in the current literature have precluded the ability to effectively automate these methods, which, in turn, have caused limited adoptions by the application communities. This study attempts to address a key gap in the literature-specifically, the evaluation of the above methods to choose the best method, particularly in terms of their robustness for short and noisy data, based on comparisons with the theoretical MI estimates, which can be computed analytically, as well with linear correlation and Kendall's tau . Here we consider smaller data sizes, such as 50, 100, and 1000, and within this study we characterize 50 and 100 data points as very short and 1000 as short. We consider a broader class of functions, specifically linear, quadratic, periodic, and chaotic, contaminated with artificial noise with varying noise-to-signal ratios. Our results indicate KDEs as the best choice for very short data at relatively high noise-to-signal levels whereas the performance of KNNs is the best for very short data at relatively low noise levels as well as for short data consistently across noise levels. In addition, the optimal smoothing parameter of a Gaussian kernel appears to be the best choice for KDEs while three nearest neighbors appear optimal for KNNs. Thus, in situations where the approximate data sizes are known in advance and exploratory data analysis and/or domain knowledge can be used to provide a priori insights into the noise-to-signal ratios, the results in the paper point to a way forward for automating the process of MI estimation. FAU - Khan, Shiraj AU - Khan S AD - Computational Sciences and Engineering, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA. FAU - Bandyopadhyay, Sharba AU - Bandyopadhyay S FAU - Ganguly, Auroop R AU - Ganguly AR FAU - Saigal, Sunil AU - Saigal S FAU - Erickson, David J 3rd AU - Erickson DJ 3rd FAU - Protopopescu, Vladimir AU - Protopopescu V FAU - Ostrouchov, George AU - Ostrouchov G LA - eng PT - Journal Article DEP - 20070814 PL - United States TA - Phys Rev E Stat Nonlin Soft Matter Phys JT - Physical review. E, Statistical, nonlinear, and soft matter physics JID - 101136452 EDAT- 2007/10/13 09:00 MHDA- 2007/10/13 09:01 CRDT- 2007/10/13 09:00 PHST- 2007/02/06 00:00 [received] PHST- 2007/05/17 00:00 [revised] PHST- 2007/10/13 09:00 [pubmed] PHST- 2007/10/13 09:01 [medline] PHST- 2007/10/13 09:00 [entrez] AID - 10.1103/PhysRevE.76.026209 [doi] PST - ppublish SO - Phys Rev E Stat Nonlin Soft Matter Phys. 2007 Aug;76(2 Pt 2):026209. doi: 10.1103/PhysRevE.76.026209. Epub 2007 Aug 14.