Probabilistic Dependency Modeling Toolkit
Leo Lahti*¹ and Olli-Pekka Huovilainen¹

(1) Department of Information and Computer Science, Aalto University, Finland.

Analysis of statistical dependencies between co-occurring observations allows the discovery of regularities and interactions that are not seen in individual data sets. Demand for such methods is increasing with the availability of co-occurrence data in many fields, including computational biology, open data initiatives, and other domains. Convenient open access implementations help to realize the full potential of these information sources.

Algorithms

Probabilistic versions of PCA [1], factor analysis [2], and CCA [3-4] are obtained as special cases of a general latent variable framework for dependency modeling (see vignette for details). Probabilistic framework deals rigorously with the uncertainties associated with small sample sizes, and allows incorporation of prior information in the analysis. Further tools are available for regularized dependency detection [5-6] and dimensionality reduction [8].

Test runs for many models are available here, including example scripts and performance statistics.

Applicability of the models has been demonstrated in previous case studies [5-8].

System requirements

Techniques for the discovery and analysis of statistical dependencies are implemented in R, an open source environment for statistical computing. This is available for all major platforms, including Linux, Mac, and Windows.

Installation

See the package vignette for installation instructions and functionality. Source code is available from project page at R-Forge. For application tools for data integration in functional genomics, see pint BioConductor package.

Licensing terms

Licensed under the FreeBSD open source license.

Acknowledgements

Authors: Leo Lahti and Olli-Pekka Huovilainen.

Contributors: Arto Klami, Abhishek Tripathi.

The authors are associated with the Statistical Machine Learning and Bioinformatics group at the Department of Information and Computer Science, Aalto University, Finland.

Your feedback and contributions are welcome. See the project page at R-Forge, or contact project admin.

References

  1. M.E. Tipping and C.M. Bishop (1999). Probabilistic principal component analysis. Journal of Royal Statistical Society B 61(3):611-622, 1999. (pdf)
  2. D.B. Rubin and D.T. Thayer (1982). EM Algorithms for ML Factorial Analysis Psychometrika 47(1):69-76, 1982. (electronic version)
  3. F. Bach and M. Jordan (2005). A probabilistic interpretation of canonical correlation analysis. Technical report. Department of Statistics, University of California, Berkeley. (pdf)
  4. C. Archambeau, Nicolas Delannay, and Michel Verleysen (2006). Robust probabilistic projections. In W.W. Cohen and A. Moore, editors, Proc. 23rd Int'l Conference on Machine Learning (ICML'06), p. 33-40. ACM. (pdf; errata)
  5. L. Lahti, S. Myllykangas, S. Knuutila, and S. Kaski (2009). Dependency detection with similarity constraints. In Proc. MLSP'09 IEEE Int'l Workshop on Machine Learning for Signal Processing p. 89-94, IEEE, Piscataway, NJ, 2009. (arXiv)
  6. L. Lahti (2010) Probabilistic analysis of the human transcriptome with side information. PhD thesis, Aalto University School of Science and Technology, Faculty of Information and Natural Sciences, Espoo, Finland, 2010. (electronic version)
  7. L. Lahti et al. (2010). Probabilistic dependency modeling toolkit. International Conference on Machine Learning (ICML-2010). Workshop on Machine Learning Open Source Software. Haifa, Israel, June 2010.
  8. A. Tripathi, A. Klami and S. Kaski (2008). Simple integrative preprocessing preserves what is shared in data sources. BMC Bioinformatics 9:111, 2008. (html)