README for Top500 angle data
(Ramachandran & rotamers)
What's in this distribution
- README.html - this file (supplemental documentation and
notes for posterity ;)
- howto/ - additional
documentation on how to use this data for specific tasks
- StructValid.pdf - the 2003
paper by Lovell, et al. that fully describes the Ramachandran
data, including the density-dependent smoothing
method. The rotamer data has not yet been published (as of April 2003),
but the information on materials & methods is equally applicable.
Please, read the paper carefully before trying to use this data in your
own research! [details]
- PenultRotLib.pdf - the
2000 paper by Lovell, et al. that describes an earlier analysis
of sidechain rotamers. It focuses on identifying and naming true
rotamers (i.e., energy minima) and so is complementary to this
data, which defines a probability distribution over all possible
conformations.
- Makefile - a file that will re-create stat/, pct/,
and kin/ from the data in srcdata/
- lib/ - Java software used to smooth the input data (Silk)
- scriptbin/ - a collection of AWK scripts that were used
to prepare the input data
- srcdata/ - quality-filtered input data that was
extracted
from the Top500 database
- stat/ - raw density traces that can be used for
statistical purposes, like Boltzmann energy potentials [details]
- pct/ - density traces that have been converted so as to
be useful in determing whether given conformations are allowed or
outlier [details]
- kin/ - kinemage format illustrations for exploring the
data interactively [details]
What's NOT in this distribution
About the data
The data used for both the Ramachandran and rotamer studies is drawn
from our Top500 database, a selection of 500 files from the Protein Data Bank that are high
resolution (1.8 Å or better), low homology, and high quality (see
http://kinemage.biochem.duke.edu/databases/top500.php
for details).
This gives us more than 100,000 residues to analyze. To remove some of
the noisiest data, residues with high B factors are discarded.
For the Ramachandran plots, only residues where B < 30 for all
mainchain atoms are considered. For the rotamer plots, only residues
where B < 40 for all atoms are considered. As described in The Penultimate Rotamer Library (2000),
this measure is found to significantly sharpen the distribution. Unlike
that analysis, however, data is not filtered on the basis of
all-atom contacts.
Because glycine is a symmetrical molecule, the local physical
constraints on its phi-psi preferences should also be symmetrical.
However, natural selection favors glycines only when they are
necessary, and so there is a frequency bias towards L-alpha in the
natural distribution. To generate our density traces, we symmetrize
Gly -- for each sample at (phi, psi) we generate a duplicate at
(-phi, -psi). The kins/ folder shows the natural (asymmetric)
distribution of points overlaid on the symmetrized contours. The
density
data in the pct/ and stat/ folders reflects the
symmetrized data.
As described in The Penultimate Rotamer
Library, previous rotamer libraries have included some physically
impossible "decoy" rotamers for leucine that fill roughly the same
space as real leucine rotamers. We excise Leu decoy rotamers
tt*
and mp* by discarding any data points that fall within a 34°-radius
circle around (214.3, 215.6) or within a 40°-radius circle around
(253, 10.4).
After inspecting the distributions for phenylalanine and tyrosine, we
conclude that there is no observable difference. Therefore, in order to
improve the quality of the data, rotamer data for Phe and Tyr are
merged.
Things left undone
There are some issues that have not yet been addressed in this
analysis. For cis prolines, the allowed rotamers and
Ramachandran
values are a subset of those for the trans distribution, but
here they are not treated separately. This data analysis also makes no
allowance for the secondary structure dependence of rotamers. This
issue was addressed only for location and frequency of central rotamer
values in The Penultimate Rotamer
Library, (with separately defined helix, sheet, and plus-phi
rotamer locations for Asp and Asn, and separate occurancy frequencies
for rotamers of other residues). We plan a treatment of this problem
for the full distributions in the near future.
Nomenclature changes
There has been an important change in the way that we describe the data
since publication of the Ramachandran paper.
In that paper, we report contour levels by the percentage of data
points they include; for example, the favored/allowed boundary for
Ramachandran regions was defined to be the 98% contour. This has the
weird side effect that high numbers describe areas of low data density,
and vice versa.
We are now classifying density levels by the fraction of data points excluded
(that is, the fraction of data points lying in areas of lower density
than the area under consideration). Therefore, low numbers represent
low
density, and high numbers represent high density. The scale also now
ranges from 0.0 to 1.0, rather than from 100% to 0%, so the
favored/allowed Ramachandran boundary is now set at 0.02. The new
scheme
is used to describe data in the pct/ folder.
Another important change relates to the way the smoothing functions are
specified. In the paper, we give the maximum radius of the cosine
function, the distance at which it falls to zero. The software now uses
the half-width (i.e., radius) at half-height. This is consistent
with earlier smoothing software we produced, and more importantly,
allows one specification to refer to roughly equivalent Gaussian and
cosine distribution functions. (Because a Gaussian only approaches
zero,
no maximum radius can be defined for it.)
Gaussians vs. cosines
In analyzing these discrete data points, we build up a function that
represents the density of data points in each small local region
[specified as a vector of angles -- (phi, psi) or (chi1, chi2, ...)] --
a probability distribution. We do that by representing each data point
as a Gaussian-like function and summing those up to get the overall
distribution. The result is a density trace, something like a histogram
but without some of the histogram's limitations. (See our Structural Validation paper for a full
description.)
In fact, we don't actually use a Gaussian (something like exp[-x2])
-- we use one period of a cosine function, from -pi to +pi. The curves
are a similar shape, and they produce almost the same result. In fact,
you can regenerate the data using Gaussian smoothing instead of cosines
by changing the -cosine=# switches in the Makefile
to -gaussian=#
switches instead.
However, the choice of smoothing function has far less impact on the
outcome than does the use of our density-dependent smoothing algorithm.
The problem with the traditional, one-pass, Gaussian-smoothing analysis
is that it blurs out the boundaries of the Ramachandran plot. Some
regions, like the shallow "beach" in the lower-left of the general
plot, have very sparse populations and soft boundaries. Other regions,
like alpha helix, are "cliffs" that have very high populations (many
orders of magnitude above the other regions) with very hard boundaries
(the population falls to zero just a few degrees to the right). The
traditional analysis is incapable of treating both in a way that gives
in physically realistic results -- either the beach is left too lumpy
or the cliff is smeared out.
It is for this reason that we developed the density-dependent smoothing
algorithm, which smooths the dense regions less and the sparse regions
more. In this application the cosine has some advantages over the
Gaussian, because it falls to zero at a finite distance. Thus, it can
be computed without truncation, so its volume really sums to 1. Also,
the Gaussian must be evaluated further out [we use 4.5 halfwidths as
the limit (where the value falls to ~ 1e-6 of its maximum), rather than
just 2 for the cosine] in order to get a good approximation, which
means it can take substantially longer to compute, particularly for
higher-dimensional data spaces. Finally, because its tails actually go
to zero, the cosine is less prone to smearing out the cliffs than the
Gaussian is. The suggestion for constructing a density trace using
cosines was taken from the NCSS statistical analysis software package;
see http://exploringdata.cqu.edu.au/den_trac.htm.
As far as we know, the density-dependent smoothing algorithm is
completely novel; no existing statistical technique could be discovered
to treat this type of problem. Our approach attempts to better
represent what we believe is the true underlying structure of the
(noisy) data. Thus, this analysis is almost like image processing, in
which one filters and manipulates a noisy photograph in an attempt to
extract a clearer image of the original subject. The resulting image is
quite different from the original photo, but (hopefully) is a better
representation of reality than the original data was.
Data format
All the data in the stat/, pct/, and kin/
folders is stored in plain text (ASCII) formats with Unix-style (\n)
linebreaks.
The stat/ folder
This folder holds the raw density traces for the Ramachandran and
rotamer data, where each sample is taken directly from the probability
distribution that we calculate in two passes as a sum of cosines. The
values have been normalized such that each point contributes a single
unit of area/volume/hyper-volume to the density trace (e.g., for
the Ramachandran plot, each point contributed 1.0 cubic degrees to the
volume enclosed by the distribution).
Each .data file begins with a series of comments describing
the size and spacing of the grid of samples. Samples are written one
per line, with the full coordinates (e.g., phi and psi; or chi1,
chi2, ...) followed by the density value. This format means the data
can be fed into the kinNDcont programs without further
manipulation.
This data would be appropriate for statistical applications, such as
predicting the energy difference between two conformational states.
Normalizing the data in a way that is appropriate to the application at
hand is left to the user.
Note that the Ramachandran plots are heavily biased by inter-residue
interactions -- secondary structure. For this reason, alpha helix and
beta sheet conformations are greatly over-represented relative to their
individual energies. You may find it more helpful to work with the data
labeled "nosec," which has all residues in repetitive secondary
structure removed.
The pct/ folder
This folder holds density traces that have been converted to fraction-excluded. Using the data
from stat/, we calculate for each sample what fraction of
data
points occur at lower density than that sample, and then output that
fraction. This data is appropriate for lookup tables that determine
whether, e.g., a given residue is in the favored (>0.02),
allowed, or outlier (<0.002; <0.0005 for general case) region of
the Ramachandran plot. Again, the files are self-describing.
The kin/ folder
This folder holds kinemage illustrations of the data from pct/.
View them with Mage or KiNG (downloadable from http://kinemage.biochem.duke.edu).
Last updated 23 Apr 2003 by Ian W.
Davis