Filtering OTU tables

This section presents three important ways of cleaning the OTU tables:

The two downloadable scripts below (004_global_filter.R and 005_BLAST_filter.R) implement a standard approach for cleaning OTU tables based on taxonomy, abundance and prevalence, or based on BLAST quality, in a phyloseq object.

 

 

 

For a step-by-step explanation of each filtering procedure, see the sections below:

Global filtering

A global filtering - based on taxonomic assignations, abundance of OTUs across all samples (singletons and rare features), or based on the prevalence of OTUs across samples (scarcity) - is typically applied early to remove very low-count and low-prevalence features that are disproportionately driven by sequencing/PCR error and index misassignment, thereby reducing noise and improving downstream compositional and multivariate analyses.

Taxonomy filtering

This first simple step serves to target only the relevant OTUs that were targeted with each primer sets: i.e., Bacteria for 16S, Fungi for ITS and Metazoa for CO1.

Singletons and rare OTUs

Singleton OTUs have been shown to be overwhelmingly composed of artefacts arising from PCR and sequencing errors, including chimeric molecules that are typically low-abundance and sample-restricted[5,6]. Their removal has repeatedly been shown to have negligible effects on hypothesis-driven community comparisons while substantially reducing noise[1,2,7]. On the other hand, retaining them is justifiable in studies explicitly targeting rare biosphere diversity, but requires careful manual curation[3].

Rare OTUs (beyond singletons) can hardly be distinguished from artefacts, thus, low-abundance OTUs are often excluded from the analysis. However, filtering approaches vary among studies, and there is no agreed-upon consensus as to what filtering thresholds should be employed[2,8,9]. Nevertheless, rare OTUs, just like singletons, often represent residual artefacts, and removing them has minimal influence on community comparisons. Thus, filtering reduces the complexity of microbiome data while preserving their integrity in downstream analysis[1,2,7,10]. Although rare taxa may in some contexts carry information on beta diversity, their retention is mainly justified in studies explicitly focused on rare biosphere ecology, where careful validation is required[3].

Rare OTUs are frequently removed using absolute abundance thresholds (e.g. <5 or <10 reads)[3,7], or based on relative abundance thresholds (e.g. <0.01%)[2], while others filter rare OTUs based on statistical methods[11].

Scarce OTUs

Scarce (low-prevalence) OTUs are features detected in only a small fraction of samples (<5%). The extreme sparsity of microbiome data sets poses a major analytical challenge and is largely attributable to sequencing artefacts, contamination, and/or sequencing errors[10]. Thus, scarce OTUs are commonly filtered using simple rule-of-thumb criteria (e.g. presence in at least k samples or k% of samples), which reduces noise and improves statistical stability[10]. This criterion does not relate to the relative abundance of the OTUs, and as such, rare OTUs may be prevalent microbial taxa.

In certain studies, higher thresholds for prevalence (>25% of samples) may be used to define, for instance, the core microbiome[12].

BLAST quality check

Filtering based on BLAST accuracy is applied to remove OTUs with low-confidence taxonomic assignments, typically indicated by low sequence similarity (pident), poor alignment coverage (qcovs), or poor fit of the rank resolution (evalue)[4]. Such features are frequently artefactual or biologically uninterpretable and can bias ecological inference if retained.

BLAST filtering

Filtering out OTUs based on BLAST results is normally done during bioinformatic processing, but if it was not the case for your OTU table, simple filters can be used to further remove artefacts. For instance, sequences that are too short (e.g., <250 bp, depending on primers) to be a proper amplicon, or sequences with less than 15% query coverage (qcovs), which are assumed to be fully synthetic[13].

BLAST cleaning

After filtering out any remaining potential artefacts, BLAST statistics (such as pident and evalue) can also be used to confirm taxonomic assignments. There is a lack of objective criteria (such as numerical thresholds) for taxonomic assignations of uncultured microorganisms, which are identified only by sequence data. Few studies have attempted to determine universal thresholds for each taxonomic group: some have determined some completely arbitrary rule of thumb[14], and others have identified thresholds that can be applied in specific cases[5,1518].

References

1. Sgarbi, L. F., Bini, L. M., Heino, J., Jyrkänkallio-Mikkola, J., Landeiro, V. L., Santos, E. P., Schneck, F., Siqueira, T., Soininen, J., Tolonen, K. T., & Melo, A. S. (2020). Sampling effort and information quality provided by rare and common species in estimating assemblage structure. Ecological Indicators, 110, 105937. https://doi.org/10.1016/j.ecolind.2019.105937
2. Nikodemova, M., Holzhausen, E. A., Deblois, C. L., Barnet, J. H., Peppard, P. E., Suen, G., & Malecki, K. M. (2023). The effect of low-abundance OTU filtering methods on the reliability and variability of microbial composition assessed by 16S rRNA amplicon sequencing. Frontiers in Cellular and Infection Microbiology, 13. https://doi.org/10.3389/fcimb.2023.1165295
3. Brown, S. P., Veach, A. M., Rigdon-Huss, A. R., Grond, K., Lickteig, S. K., Lothamer, K., Oliver, A. K., & Jumpponen, A. (2015). Scraping the bottom of the barrel: are rare high throughput sequences artifacts? Fungal Ecology, 13, 221–225. https://doi.org/10.1016/j.funeco.2014.08.006
4. Tedersoo, L., Bahram, M., Zinger, L., Nilsson, R. H., Kennedy, P. G., Yang, T., Anslan, S., & Mikryukov, V. (2022). Best practices in metabarcoding of fungi: From experimental design to results. Molecular Ecology, 31(10), 2769–2795. https://doi.org/10.1111/mec.16460
5. Tedersoo, L., Mikryukov, V., Anslan, S., Bahram, M., Khalid, A. N., Corrales, A., Agan, A., Vasco-Palacios, A.-M., Saitta, A., Antonelli, A., Rinaldi, A. C., Verbeken, A., Sulistyo, B. P., Tamgnoue, B., Furneaux, B., Ritter, C. D., Nyamukondiwa, C., Sharp, C., Marín, C., … Abarenkov, K. (2021). The Global Soil Mycobiome consortium dataset for boosting fungal diversity research. Fungal Diversity, 111(1), 573–588. https://doi.org/10.1007/s13225-021-00493-7
6. Sze, M. A., & Schloss, P. D. (2019). The Impact of DNA Polymerase and Number of Rounds of Amplification in PCR on 16S rRNA Gene Sequence Data. mSphere, 4(3). https://doi.org/10.1128/msphere.00163-19
7. Edgar, R. C. (2016). UNOISE2: Improved error-correction for illumina 16S and ITS amplicon sequencing. http://dx.doi.org/10.1101/081257
8. Yu, Z., Wang, H., Meng, J., Miao, M., Kong, Q., Wang, R., & Liu, J. (2017). Quantifying the responses of biological indices to rare macroinvertebrate taxa exclusion: Does excluding more rare taxa cause more error? Ecology and Evolution, 7(5), 1583–1591. https://doi.org/10.1002/ece3.2798
9. Poos, M. S., & Jackson, D. A. (2012). Addressing the removal of rare species in multivariate bioassessments: The impact of methodological choices. Ecological Indicators, 18, 82–90. https://doi.org/10.1016/j.ecolind.2011.10.008
10. Cao, Q., Sun, X., Rajesh, K., Chalasani, N., Gelow, K., Katz, B., Shah, V. H., Sanyal, A. J., & Smirnova, E. (2021). Effects of rare microbiome taxa filtering on statistical analysis. Frontiers in Microbiology, 11. https://doi.org/10.3389/fmicb.2020.607325
11. Smirnova, E., Huzurbazar, S., & Jafari, F. (2018). PERFect: PERmutation Filtering test for microbiome data. Biostatistics, 20(4), 615–631. https://doi.org/10.1093/biostatistics/kxy020
12. Simonin, M., Dasilva, C., Terzi, V., Ngonkeu, E. L. M., Diouf, D., Kane, A., Béna, G., & Moulin, L. (2020). Influence of plant genotype and soil on the wheat rhizosphere microbiome: evidences for a core microbiome across eight African and European soils. FEMS Microbiology Ecology, 96(6). https://doi.org/10.1093/femsec/fiaa067
13. Kunjapur, A. M., Pfingstag, P., & Thompson, N. C. (2018). Gene synthesis allows biologists to source genes from farther away in the tree of life. Nature Communications, 9(1). https://doi.org/10.1038/s41467-018-06798-7
14. Tedersoo, L., Bahram, M., Põlme, S., Kõljalg, U., Yorou, N. S., Wijesundera, R., Ruiz, L. V., Vasco-Palacios, A. M., Thu, P. Q., Suija, A., Smith, M. E., Sharp, C., Saluveer, E., Saitta, A., Rosas, M., Riit, T., Ratkowsky, D., Pritsch, K., Põldmaa, K., … Abarenkov, K. (2014). Global diversity and geography of soil fungi. Science, 346(6213). https://doi.org/10.1126/science.1256688
15. Yarza, P., Yilmaz, P., Pruesse, E., Glöckner, F. O., Ludwig, W., Schleifer, K.-H., Whitman, W. B., Euzéby, J., Amann, R., & Rosselló-Móra, R. (2014). Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nature Reviews Microbiology, 12(9), 635–645. https://doi.org/10.1038/nrmicro3330
16. Vu, D., Groenewald, M., Vries, M. de, Gehrmann, T., Stielow, B., Eberhardt, U., Al-Hatmi, A., Groenewald, J. Z., Cardinali, G., Houbraken, J., Boekhout, T., Crous, P. W., Robert, V., & Verkley, G. J. M. (2019). Large-scale generation and analysis of filamentous fungal DNA barcodes boosts coverage for kingdom fungi and reveals thresholds for fungal species and higher taxon delimitation. Studies in Mycology, 92(1), 135–154. https://doi.org/10.1016/j.simyco.2018.05.001
17. Pappalardo, P., Hemmi, J. M., Machida, R. J., Leray, M., Collins, A. G., & Osborn, K. J. (2025). Taxon-specific BLAST percent identity thresholds for identification of unknown sequences using metabarcoding. Methods in Ecology and Evolution, 16(10), 2380–2394. https://doi.org/10.1111/2041-210x.70147
18. Ransome, E., Geller, J. B., Timmers, M., Leray, M., Mahardini, A., Sembiring, A., Collins, A. G., & Meyer, C. P. (2017). The importance of standardization for biodiversity comparisons: A case study using autonomous reef monitoring structures (ARMS) and metabarcoding to measure cryptic diversity on Moorea coral reefs, French Polynesia. PLOS ONE, 12(4), e0175066. https://doi.org/10.1371/journal.pone.0175066