This gives the original (2011) version of the Top8000: a database of about 8000 high-resolution, quality-filtered protein chains. These structures were used to update the torsional distributions used in MolProbity (Chen 2010 Acta Cryst D66:12 RLab Publications), but they are applicable more generally for a wide range of structural bioinformatics studies. Later versions (~2015) were separately selected with a requirement for deposited structure factors (used for our 2015 rotamer update). "SF" versions were also separately selected with a requirement for deposited structure factors. Lists of pdbID + chain for the SF versions (but not the MolProbity-processed PDB files) are available on GitHub (http://github.com/rlabduke/reference_data).

For all this work we required each chain to have:

  • resolution < 2.0Å,
  • MolProbity score < 2.0,
  • ≤ 5% of residues with bond length outliers (> 4σ),
  • ≤ 5% of residues with bond angle outliers (> 4σ), and
  • ≤ 5% of residues with Cβ deviation outliers (> 0.25Å).

We then selected the best chain (in terms of average of resolution and MolProbity score) per PDB homology cluster run on the PDB release of March 25, 2011. There were a small number of ties within clusters (for < 1% of the final chain tallies); these were resolved, arbitrarily but reproducibly, by alphabetical order of PDB ID + single-character chain ID.

These are "added-value" individual chains in PDB format with MolProbity (run on March 29, 2011) added hydrogens and corrections for "flip" misfittings. The hydrogens here are at the nuclear positions rather than the later electron-cloud positions (Deis 2013 Comp. Cryst. Newsletter 4:9-10). The "flips" are done by renaming the atom positions, which is most accurate for determining the diagnostic H-bond and clash differences (Word 1999b RLab Publications), but distorts the flipped bondlengths and angles. The files in this tarball are very suitable as bulk survey data, but the details will not exactly match results from later versions of MolProbity.

Here is a download link for the 7957 (≈ 8000) PDB files for the "standard" Top8000 database, which uses the 70% PDB homology level:

Top8000 chains (tarball ... warning: 754 MB!)

(The 50%, 90%, and 95% PDB homology level versions are also available upon request.)

If you're looking for the old Top500 database, which is almost a decade older than the Top8000 and smaller by over an order of magnitude, try this link.