Pitfalls of bacterial pan-genome analysis approaches

publication

Bioinformatics, Volume 41, Issue 5, May 2025

pan-genome M. tuberculosis bioinformatics methods bacteria

Abstract

Pan-genome analysis is a fundamental tool for studying bacterial genome evolution; however, the variety in methods used to define and measure the pan-genome poses challenges to the interpretation and reliability of results. Using Mycobacterium tuberculosis, a clonally evolving bacterium with a small accessory genome, as a model system, we systematically evaluated sources of variability in pan-genome estimates. Our analysis revealed that differences in assembly type (short-read versus hybrid), annotation pipeline, and pan-genome software significantly impact predictions of core and accessory genome size. Extending our analysis to two additional bacterial species, Escherichia coli and Staphylococcus aureus, we observed consistent tool-dependent biases but species-specific patterns in pan-genome variability. Our findings highlight the importance of integrating nucleotide- and protein-level analyses to improve the reliability and reproducibility of pan-genome studies across diverse bacterial populations.

Key findings

  • Assembly type (short-read vs. hybrid), annotation pipeline, and choice of pan-genome software are each major sources of variability in pan-genome size estimates
  • These biases are consistent across tools but show species-specific patterns
  • Integrating both nucleotide- and protein-level analyses improves reliability

Associated software: panqc

This work led to the development of panqc, a simple quality control tool for bacterial pan-genome analyses.

Citation

Maximillian G Marin, Natalia Quinones-Olvera, Christoph Wippel, Mahboobeh Behruznia, Brendan M Jeffrey, Michael Harris, Brendon C Mann, Alex Rosenthal, Karen R Jacobson, Robin M Warren, Heng Li, Conor J Meehan, Maha R Farhat. Pitfalls of bacterial pan-genome analysis approaches: a case study of Mycobacterium tuberculosis and two less clonal bacterial species. Bioinformatics, Volume 41, Issue 5, May 2025, btaf219. https://doi.org/10.1093/bioinformatics/btaf219