README for Top500 angle data
(Ramachandran & rotamers)


What's in this distribution

What's NOT in this distribution


About the data

The data used for both the Ramachandran and rotamer studies is drawn from our Top500 database, a selection of 500 files from the Protein Data Bank that are high resolution (1.8 Å or better), low homology, and high quality (see http://kinemage.biochem.duke.edu/databases/top500.php for details).

This gives us more than 100,000 residues to analyze. To remove some of the noisiest data, residues with high B factors are discarded. For the Ramachandran plots, only residues where B < 30 for all mainchain atoms are considered. For the rotamer plots, only residues where B < 40 for all atoms are considered. As described in The Penultimate Rotamer Library (2000), this measure is found to significantly sharpen the distribution. Unlike that analysis, however, data is not filtered on the basis of all-atom contacts.

Because glycine is a symmetrical molecule, the local physical constraints on its phi-psi preferences should also be symmetrical. However, natural selection favors glycines only when they are necessary, and so there is a frequency bias towards L-alpha in the natural distribution. To generate our density traces, we symmetrize Gly -- for each sample at (phi, psi) we generate a duplicate at (-phi, -psi). The kins/ folder shows the natural (asymmetric) distribution of points overlaid on the symmetrized contours. The density data in the pct/ and stat/ folders reflects the symmetrized data.

As described in The Penultimate Rotamer Library, previous rotamer libraries have included some physically impossible "decoy" rotamers for leucine that fill roughly the same space as real leucine rotamers. We excise Leu decoy rotamers tt* and mp* by discarding any data points that fall within a 34°-radius circle around (214.3, 215.6) or within a 40°-radius circle around (253, 10.4).

After inspecting the distributions for phenylalanine and tyrosine, we conclude that there is no observable difference. Therefore, in order to improve the quality of the data, rotamer data for Phe and Tyr are merged.

Things left undone

There are some issues that have not yet been addressed in this analysis. For cis prolines, the allowed rotamers and Ramachandran values are a subset of those for the trans distribution, but here they are not treated separately. This data analysis also makes no allowance for the secondary structure dependence of rotamers. This issue was addressed only for location and frequency of central rotamer values in The Penultimate Rotamer Library, (with separately defined helix, sheet, and plus-phi rotamer locations for Asp and Asn, and separate occurancy frequencies for rotamers of other residues). We plan a treatment of this problem for the full distributions in the near future.

Nomenclature changes

There has been an important change in the way that we describe the data since publication of the Ramachandran paper. In that paper, we report contour levels by the percentage of data points they include; for example, the favored/allowed boundary for Ramachandran regions was defined to be the 98% contour. This has the weird side effect that high numbers describe areas of low data density, and vice versa.

We are now classifying density levels by the fraction of data points excluded (that is, the fraction of data points lying in areas of lower density than the area under consideration). Therefore, low numbers represent low density, and high numbers represent high density. The scale also now ranges from 0.0 to 1.0, rather than from 100% to 0%, so the favored/allowed Ramachandran boundary is now set at 0.02. The new scheme is used to describe data in the pct/ folder.

Another important change relates to the way the smoothing functions are specified. In the paper, we give the maximum radius of the cosine function, the distance at which it falls to zero. The software now uses the half-width (i.e., radius) at half-height. This is consistent with earlier smoothing software we produced, and more importantly, allows one specification to refer to roughly equivalent Gaussian and cosine distribution functions. (Because a Gaussian only approaches zero, no maximum radius can be defined for it.)

Gaussians vs. cosines

In analyzing these discrete data points, we build up a function that represents the density of data points in each small local region [specified as a vector of angles -- (phi, psi) or (chi1, chi2, ...)] -- a probability distribution. We do that by representing each data point as a Gaussian-like function and summing those up to get the overall distribution. The result is a density trace, something like a histogram but without some of the histogram's limitations. (See our Structural Validation paper for a full description.)

In fact, we don't actually use a Gaussian (something like exp[-x2]) -- we use one period of a cosine function, from -pi to +pi. The curves are a similar shape, and they produce almost the same result. In fact, you can regenerate the data using Gaussian smoothing instead of cosines by changing the -cosine=# switches in the Makefile to -gaussian=# switches instead.

However, the choice of smoothing function has far less impact on the outcome than does the use of our density-dependent smoothing algorithm. The problem with the traditional, one-pass, Gaussian-smoothing analysis is that it blurs out the boundaries of the Ramachandran plot. Some regions, like the shallow "beach" in the lower-left of the general plot, have very sparse populations and soft boundaries. Other regions, like alpha helix, are "cliffs" that have very high populations (many orders of magnitude above the other regions) with very hard boundaries (the population falls to zero just a few degrees to the right). The traditional analysis is incapable of treating both in a way that gives in physically realistic results -- either the beach is left too lumpy or the cliff is smeared out.

It is for this reason that we developed the density-dependent smoothing algorithm, which smooths the dense regions less and the sparse regions more. In this application the cosine has some advantages over the Gaussian, because it falls to zero at a finite distance. Thus, it can be computed without truncation, so its volume really sums to 1. Also, the Gaussian must be evaluated further out [we use 4.5 halfwidths as the limit (where the value falls to ~ 1e-6 of its maximum), rather than just 2 for the cosine] in order to get a good approximation, which means it can take substantially longer to compute, particularly for higher-dimensional data spaces. Finally, because its tails actually go to zero, the cosine is less prone to smearing out the cliffs than the Gaussian is. The suggestion for constructing a density trace using cosines was taken from the NCSS statistical analysis software package; see http://exploringdata.cqu.edu.au/den_trac.htm.

As far as we know, the density-dependent smoothing algorithm is completely novel; no existing statistical technique could be discovered to treat this type of problem. Our approach attempts to better represent what we believe is the true underlying structure of the (noisy) data. Thus, this analysis is almost like image processing, in which one filters and manipulates a noisy photograph in an attempt to extract a clearer image of the original subject. The resulting image is quite different from the original photo, but (hopefully) is a better representation of reality than the original data was.

Data format

All the data in the stat/, pct/, and kin/ folders is stored in plain text (ASCII) formats with Unix-style (\n) linebreaks.

The stat/ folder

This folder holds the raw density traces for the Ramachandran and rotamer data, where each sample is taken directly from the probability distribution that we calculate in two passes as a sum of cosines. The values have been normalized such that each point contributes a single unit of area/volume/hyper-volume to the density trace (e.g., for the Ramachandran plot, each point contributed 1.0 cubic degrees to the volume enclosed by the distribution).

Each .data file begins with a series of comments describing the size and spacing of the grid of samples. Samples are written one per line, with the full coordinates (e.g., phi and psi; or chi1, chi2, ...) followed by the density value. This format means the data can be fed into the kinNDcont programs without further manipulation.

This data would be appropriate for statistical applications, such as predicting the energy difference between two conformational states. Normalizing the data in a way that is appropriate to the application at hand is left to the user.

Note that the Ramachandran plots are heavily biased by inter-residue interactions -- secondary structure. For this reason, alpha helix and beta sheet conformations are greatly over-represented relative to their individual energies. You may find it more helpful to work with the data labeled "nosec," which has all residues in repetitive secondary structure removed.

The pct/ folder

This folder holds density traces that have been converted to fraction-excluded. Using the data from stat/, we calculate for each sample what fraction of data points occur at lower density than that sample, and then output that fraction. This data is appropriate for lookup tables that determine whether, e.g., a given residue is in the favored (>0.02), allowed, or outlier (<0.002; <0.0005 for general case) region of the Ramachandran plot. Again, the files are self-describing.

The kin/ folder

This folder holds kinemage illustrations of the data from pct/. View them with Mage or KiNG (downloadable from http://kinemage.biochem.duke.edu).
Last updated 23 Apr 2003 by Ian W. Davis