Graduate student Kansas State University, United States
Abstract: Protein solubility is a critical attribute that directly impacts various protein functionalities and key characteristics of food products. Accurate protein solubility prediction is crucial in screening suitable candidates for food application. Existing models often rely only on sequences, overlooking important structural details. In this study, a regression model for protein solubility was developed using both the sequences and predicted structures of 2983 E. coli proteins. The sequence and structural level properties of the proteins were bioinformatically extracted and subjected to multilayer perceptron. Moreover, residue level features and contact maps were utilized to construct a graph convolutional network. The out-of-fold predictions of the two models were combined and fed into multiple meta-regressors to create a stacking model. The stacking model with support vector regressor achieved R2 of 0.502 and 0.468 on test and external validation datasets, respectively, displaying higher performance compared to existing regression models. The stacking model also outperformed other models in binary classification task (0.805 vs ~0.787 accuracy). Based on the improved performance compared to its based models, the stacking model effectively captured the strength of its base models as well as the significance of the different features used. Furthermore, the model's transferability was indirectly validated on a dataset of seed storage proteins using Osborne definition as well as on a case study using molecular dynamic simulation, showing potential for application beyond microbial proteins to food and agriculture-related ones.