This gives the original (2011) version of the Top8000: a database of about 8000 high-resolution, quality-filtered protein chains. These structures were used to update the torsional distributions used in MolProbity (Chen 2010 Acta Cryst D66:12), but they are applicable more generally for a wide range of structural bioinformatics studies. Later versions (~2015) were separately selected with a requirement for deposited structure factors (used for our 2015 rotamer update). "SF" versions were also separately selected with a requirement for deposited structure factors. Lists of pdbID + chain for the SF versions (but not the MolProbity-processed PDB files) are available on GitHub (http://github.com/rlabduke/reference_data).
For all this work we required each chain to have:
We then selected the best chain (in terms of average of resolution and MolProbity score) per PDB homology cluster run on the PDB release of March 25, 2011.
There were a small number of ties within clusters (for < 1% of the final chain tallies); these were resolved, arbitrarily but reproducibly, by alphabetical order of PDB ID + single-character chain ID.
These are "added-value" individual chains in PDB format with MolProbity (run on March 29, 2011) added hydrogens and corrections for "flip" misfittings. The hydrogens here are at the nuclear positions rather than the later electron-cloud positions (Deis 2013 Comp. Cryst. Newsletter 4:9-10). The "flips" are done by renaming the atom positions, which is most accurate for determining the diagnostic H-bond and clash differences (Word 1999b J Mol Biol 285, 1735-47), but distorts the flipped bondlengths and angles. The files in this tarball are very suitable as bulk survey data, but the details will not exactly match results from later versions of MolProbity.
The 50%, 70%, 90%, and 95% PDB homology level versions can be provided upon request, however, we strongly encourage prospective users to use the updated top2018 version of this dataset.