Abstract:Ubiquitous microorganisms, key players in biogeochemical cycles and environmental evolution, are involved in environmental monitoring as well as ecological governance and protection. The booming high-throughput technologies have generated massive microbial data and expanded the scope of microbiome research. Constructing machine learning models to analyze complex microbial data is of great importance to microbial marker identification, pollutant prediction, and environmental quality prediction. Machine learning algorithms can be classified into two categories:supervised learning and unsupervised learning. In microbiome research, unsupervised learning grasps the characteristics of input data through clustering and dimensionality reductions, enabling the integration and classification of microbial data. Supervised learning uses microbial datasets with features and labels to train and build models that can be used to classify, identify, and predict new data without labels. However, sophisticated machine learning algorithms often focus on the accuracy of model predictions at the expense of interpretability. Machine learning models can often be regarded as a "black box" that predicts a specific outcome. Little is known about how the prediction is obtained by the model. Improving model interpretability is critical for the accurate application of machine learning and the extraction of valuable biological information in microbiome research. This review introduced the machine learning algorithms commonly used in environmental microbiology and the construction steps (including feature selection, algorithm selection, model construction and evaluation) of machine learning models based on microbiome data. Furthermore, we summarized several application scenarios of machine learning models in environmental microbiology for in-depth exploration of the relationship between the microbiome and the surrounding environment, attempting to improve the interpretability of the model and provide a reference for future environmental monitoring and environmental health prediction.