Risky non-orthogonal features

With the frequent use of machine learning in binding site prediction, many different features have been proposed. But the relationship between the features and the prediction is loose, because of the complicated learning process. Further, many of the features, who are derived in similar ways are not orthogonal with each other and cannot stand for different aspects of protein-nucleic acid binding. Non-orthogonal features distort the metric structure of the data space and contain reductant information. Therefore, some of the features are not effective enough as demonstrated in the papers. For instance, some prediction approaches map the sequence to AA index values. Then, a residue is expressed as a feature vector of values. This is an easy process to implement in machine learning. But 20 amino acid types only result in 20 different vectors to the learning machine. If a learning machine can always reach the most optimized model, such mapping process makes no different from directly taking the residue type as input.

Scheme to illustrate the redundancy of 'New feature' by just mapping to feature values.

The learning process together with the mapping process can be taken as a big learning machine. If such a method is assessed by cross-validation with small scale independent test, conclusion could be misleading. Therefore, it is better to know the features are orthogonal and derived in different ways before using. E.g. electrostatics potential could be derived from structure based calculation, evolutionary information derived from homologous sequence search and alignment.