Summary: | The type VI secretion system (T6SS) delivers effector proteins (Type VI secretion system effectors, termed T6SEs) into neighboring target cells. Many human pathogens express T6SEs, including Vibrio cholera, Burkholderia spp., and Pseudomonas aeruginosa. T6SEs play vital roles in the competitive survival and pathogenesis of bacterial populations. Several machine-learning methods are able to distinguish T6SEs from non-T6SEs. However, we believe there is room for further development. Therefore, herein we propose a more powerful ensemble predictor for identifying T6SEs. Initially, we construct a benchmark dataset from existing studies and databases. Then we use $k$ -separated-bigrams-PSSM (a type of feature encoding) to convert the protein sequences to mathematical vectors. A synthetic minority oversampling technique (SMOTE) is next employed to solve the training data imbalance problem. Finally, we employ a soft voting strategy to construct an integrated model combining six fine-tuned base classifiers. The model we propose performs excellently in terms of accuracy (ACC, 99.0%), Matthew's correlation coefficient (MCC, 97.8%), sensitivity (SN, 97.1%), and specificity (SP, 100%) in independent testing.
|