I think many researchers overtrust completeness assessment scores given by BUSCO. To investigate this matter, we previously analyzed the performance of BUSCO and reported its insufficient sensitivity by using the human 'T2T' genome assembly (see below; Yamaguchi et al., 2021; also see Table 1 of Huang and Heng, 2023). Our paper by Yamaguchi et al. (2021) was briefly covered by one of the earlier blog posts.
The first part of the table listing the 79 BUSCO reference orthologs that an earlier version of BUSCO could not identify in the human 'T2T' genome assembly but we identified with our quite manual inspection.
It seems that BUSCO has adopted miniprot to map reference orthologs (see its official guide) which should lead to enhanced sensitivity in scanning protein-coding space in eukaryote genomes.
'As of v5.7.0, Miniprot is the default tool for eukaryotic genome mode. Miniprot is not a gene predictor, but a gene mapper, and uses a reference protein database (provided in the BUSCO datasets) to map proteins to the genome. Miniprot is generally faster than Augustus and Metaeuk (except on fungi, in which Metaeuk is still faster), and typically yields slightly higher completion scores. However, Miniprot may underperform for highly divergent assemblies (with respect to the species in the BUSCO datasets), due to its limited sensitivity to divergent orthologs. ...' (cited from official BUSCO website)
Importantly, the group of Heng Li already implemented miniprot in compleasm, and I am quite satisfied with the scores given by compleasm. Still, there is no conclusive remark on the subject, 'BUSCO or Compleasm?'. This issue should be discussed more carefully with another aspect of completeness assessment, namely the choice of reference ortholog set. I acutually see a merit in more rigidly choosing reference genes that do not allow any absence among the species with already sequenced genomes (which is not the case with current BUSCO gene sets).
Much earlier, we also evaluated CEGMA (!!) and BUSCO, pointed out systematic underestimate of completeness scores for jawless fishes (hagfish and lamprey), and introduced our original reference ortholog set CVG (Core Vertebrate Genes) (Hara et al., 2015). In fact, this gene set was based on a rigid criterion of reference gene choice mentioned above. Later, our experience led to the launch of the webserver gVolante and its default/alternative parameter setting at gVolante to disseminate more thoughtful practice of completeness assessment of large-scale sequence datasets. We understand that the functions of gVolante are expected to be updated (for example, to intake BUSCO v5.7) , but will somehow maintain it.
Kommentare