Statistical potential

cosmos 8th November 2016 at 2:02pm

As used in Protein structure analysis, Statistical potentials score decoys by comparing their features to experimentally- determined structures, based on the assumption that the observed distributions of particular features reflect energetics: i.e., a common characteristic is assumed to be energetically favourable.

RAPDF (residue-specific all-atom probability discriminatory function)

Essentially a Naive Bayes classifier trained on a set CC of native structures (structures observed experimentally and assumed to be correct).

We wish to evaluate P(C{dijab})P(C|\{d_{ij}^{ab}\}), theprobability the structure is a member of the "correct" set CC, given it contains the distances {dijabd_{ij}^{ab}}. We write this probability, using Baye's theorem as

P(C{dijab})=P(C)P({dijab}C)P({dijab})P(C|\{d_{ij}^{ab}\}) = P(C) \frac{P(\{d_{ij}^{ab}\}|C)}{P(\{d_{ij}^{ab}\})}

We then make the assumption that P({dijab}C)P(\{d_{ij}^{ab}\}|C) factorizes as ijP(dijabC)\prod\limits_{ij}P(d_{ij}^{ab}|C) (i.e. the Naive Bayes assumption!).

The score of each decoy (with features {dijab}\{d_{ij}^{ab}\}) is then just the negative log-likelihood

S({dijab})=lnP(C{dijab})=ijln(P(dabijC)P(dabij))S(\{d_{ij}^{ab}\}) = -\ln{P(C|\{d_{ij}^{ab}\})} = \sum_{ij} \ln{\left (\frac{P(d^{ij}_{ab}|C)}{P(d^{ij}_{ab})} \right)}

where SS is the score for the decoy, and dabijd^{ij}_{ab} is the distance between atoms ii and jj, of types aa and bb.

See more explanation here

The probabilites are estimated (as in Naive Bayes) as sample frequencies:

P(dabijC)=N(dabij)d(dabij)P(d^{ij}_{ab}|C) = \frac{N(d^{ij}_{ab})}{\sum_d (d^{ij}_{ab})}

where the distances dabijd^{ij}_{ab} have been discretized, and NN means number of occurrences of distance dd between residues of type aa and bb over all native configurations in CC.

The average over all experimental structures is

P(dabij)=N(dabij)d(dabij)P(d^{ij}_{ab}) = \frac{N(d^{ij}_{ab})}{\sum_d (d^{ij}_{ab})} over all structures, not just those in CC.

However, they further approximate this, and assume (I guess as over all structures, there is more randomness..) that this is independent of aa and bb, and that they can estimate this as the average over all aa and bb and all structures, in CC (because we don't have the whole set of possible structures I guess)

P(dabij)P(d)=abN(dabij)dab(dabij)P(d^{ij}_{ab}) \approx P(d) = \frac{\sum_{ab}N(d^{ij}_{ab})}{\sum_d \sum_{ab} (d^{ij}_{ab})} over structres in CC.