Alzyfinder 1.0

Validation of Models

In the present work, we selected 85 key proteins in Alzheimer's disease. For further details about the target see the primary citation here. For each AD key protein, using the XGBoost classification and regression algorithm[1], three different ML models were built and then evaluated based on balanced accuracy, precision and F1 score metrics, respectively. Balanced Accuracy is the arithmetic means of sensitivity and specificity and is used in both binary and multi-class classification ML models. It is recommended to be used when dealing with imbalanced data. Precision quantifies the number of correct positive predictions by calculating the accuracy of the true positive. F1 score was selected because it integrates precision and recall into a single metric to gain a better understanding of model performance. F1 score is commonly used as an evaluation metric in binary ML modes. These metrics are widely recommended in the literature for studies that require accurate interpretation of model effectiveness across classes[2].

As an example, the validation of the ML model built for BACE1 (beta secretase-1) protein (Uniprot ID: P56817) evaluated with the precision metric is shown in Figure 1. 2902 active and 1690 inactive compounds were used to build ~3000 models for each run. We use 75% of the molecules for model construction and fitting, while the remaining 25% constituted the test set. The data partitioning was conducted using a random selection method with a fixed seed, thereby ensuring the reproducibility of these results in subsequent studies. This approach guarantees that the same data split will be replicated consistently, facilitating the validation and comparison of models in future research endeavors. The models were optimized using Optuna and to select the best model we used the Pareto optimization approach[3], that allow us to minimize the discrepancy between training and validation results, and maximize performance based on the precision evaluation metric. Figure 1A shows the Pareto chart with the best model selected after optimization. The XGBoost parameters of the best model in this case are: alpha = 1.0545, colsample_bytree = 0.5502, gamma = 0.1833, lambda = 1.8003, learning_rate = 0.0579, max_depth = 10.0, min_child_weigth = 1.0, n_estimators = 632.0, reg_alpha = 0.5484, reg_lambda = 0.3758, scale_pos_weigth = 1.118, and subsample = 0.8801. To illustrate the diagnostic ability of the ML BACE1 model evaluated with the precision metric, we plotted sensitivity against the specificity using receiver operating characteristic (ROC) curve, and then calculated the area under the curve (AUC - values between 0 and 1, the higher the better) for both the train and test sets (Figure 1B). AUC describes the probability of the ML model to rank a randomly chosen positive instance higher than a negative one. In addition, to know whether the model is reliable or not, the calibration curve (or reliability diagram) was plotted (Figure 1C). This curve compares how well the probabilistic predictions of our binary ML model are calibrated for both test and trains sets. Calibration curves that are closer to the diagonal indicate that the ML model is well calibrated.

Figure 1. Validation of the machine learning model build for human BACE1 (Uniprot ID: P56817), evaluated with the precision metric. A. Pareto chart for all 2193 ML models. The best model (trial) is showed as a purple big dot. B. Receiver Operating Characteristic (ROC) curve and area under the ROC (AUC) calculated for both training and test sets. C. Calibration curve calculated for both training and test sets.

Afterwards, to test the BACE1-ML model we use the 25% of the remaining BACE1 active/inactive molecules and calculated the confusion matrix. This predictive analysis tool provides a structure comparison between true values with model predictive values, and it can be employed to evaluate different metrics used in the models. Within the matrix, four key variables characterize the classification process: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) (Figure 2A). TP is when the positive data is classified as positive. TN is when the negative data is classified as negative. FP is when the negative data is classified as positive, and FN is when the positive data is classified as negative. Figure 2B illustrates a normalized confusion matrix derived from the BACE1-ML model. The data indicates that 83% of inactive compounds are correctly identified as such, while 94% of active compounds are correctly classified as active. Conversely, only 17% of inactive compounds are incorrectly classified as actives, and a minimal 0.55% of active compounds are erroneously classified as inactive. This matrix provides a quantitative overview of the model's classification performance, highlighting its ability to correctly identify both positive and negative instances, as well as its misclassification rates.

To gain further insight into this information, Figure 2C illustrates a bar plot divided into two distinct zones: inactive and active. This representation demonstrates that decoys and inactive compounds are predominantly located in the inactive zone (as expected), which aligns with the findings of the confusion matrix. Likewise, most active compounds are situated within the active zone. While there are instances where decoys and inactive compounds are classified as active, such occurrences are minimal. This visual representation provides a clear illustration of how the model's classification aligns with the distribution of compounds into active and inactive categories, corroborating the insights derived from the confusion matrix analysis.

To access all the data derived from the construction and validation see the details in the list on the right panel.

Figure 2. Probability analysis for machine learning model build for human BACE1 (Uniprot ID: P56817), evaluated with the precision metric. A and B. Confusion matrix indicating True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). C. Probability diagram indicating the classification of true active (blue) and true inactive (orange) molecules, as well as decoys (green).

References

Tarwidi, D.; Pudjaprasetya, S. R.; Adytia, D.; Apri, M. An Optimized XGBoost-Based Machine Learning Method for Predicting Wave Run-up on a Sloping Beach. MethodsX 2023, 10, 102119.
He, H.; Garcia, E. A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21 (9), 1263–1284.
Chen, Y.-S.; Chong, P. P.; Tong, M. Y. Mathematical and Computer Modelling of the Pareto Principle. Math. Comput. Model. 1994, 19 (9), 61–80.

Download document with validation details

G Protein Coupled Receptors

ADORA2A

ADORA3

ADRA1A

ADRA1B

ADRA1D

ADRA2A

ADRA2B

ADRA2C

ADRB1

ADRB2

ADRB3

AGTR1

CHRM1

CHRM2

CHRM3

CNR1

CNR2

DRD2

DRD3

DRD4

GNRHR

HCRTR1

HCRTR2

HRH3

MTNR1A

Ion Channel

APH1A

APH1B

CETP

CHRNA4

CHRNB2

HTR1A

HTR2C

HTR6

NCSTN

PSEN1

PSEN2

PSENEN

SCN5A

SCN9A

TRPV1

Kinase

ABL1

EPHA2

FGFR3

FYN

GSK3A

GSK3B

INSR

KIT

LCK

MAPK14

PDGFRB

Miscellaneous

ACE

ACHE

APP

BACE1

BCHE

ESR1

ESR2

FAAH

HDAC2

HDAC3

HDAC6

HSD11B1

MAOA

MAOB

MMP1

MMP13

MMP8

NR3C1

PDE3A

PDE4D

PDE5A

PGR

PLA2G7

PPARA

PPARD

PPARG

PTGS2

RXRA

SIGMAR1

SLC5A2

SLC6A3

SLC6A4

TNF

VDR