CCAPD: Consolidated Curated Alignments for Phylogenetics Database

The CCAPD is a multiple sequence alignment database of 1215 protein sequences (including variants) - in 28 homologous groups, retrieved from 217 species. 233 of the protein sequences have structures associated, and all 28 homologous groups have at least two sequences with structures associated with them. The database concentrates on sequences and species of phylogenetic utility, including for the use of phylogenetics in structural biology. CCAPD consolidates manually-curated structurally-informed alignments from 3D_ali, HOMSTRAD, and Pfam as well as locally-performed structural alignments with manual review. Uncertain regions in the alignment (due to unreliable structural information, disagreements between databases, or other reasons) have been determined with manual review.

Interpretation of files

The primary alignment file format is the HTML display version produced using the showalign program (with the locally-created ESIMILARITY matrix) from EMBOSS. (This was then processed further using locally-created Perl programs.)

Showalign format interpretation

Interpretation of the primary showalign format files

Reliable regions in the structural alignment are marked in blue. Regions that were not structurally alignable due to gaps in the structures (intrinsically disordered regions) are marked in red. Areas of uncertain alignment (as determined by structural alignment distances, disagreements between databases, and manual review) are in black.
Groups of lines seperated by the "=" symbol are 65%+ identical to a sequence with known structure; these are "clusters". Areas not in blue are only aligned within clusters.
Sequence origin color coding:
- Archaeal sequences are marked with a purple A
- Sequences from fungi or metazoa (animals) are marked with a red *
- Plant sequences are marked with a green P
- Sequences from bacteria marked with a blue B
Further information on sequence names ("entries") can be found in the group files (below).

Alternative showalign formats
Two alternative versions of the showalign-produced alignment files are also available. (These are recommended for users of non-graphical browsers.)

The first version uses the HTML tag "STRONG" for structurally-aligned regions and the "EM" tag for intrinsically disordered regions.
The second version shows only the structurally-aligned regions differently (unless the browser is capable of displaying text as red).

Sequence origins are coded in these files using the same characters as above, but without colors.

Plotcon plots of residue conservation

Plots of degree of residue conservation versus the position along the alignment are also available (in postscript and PDF formats). The plots were produced using the EMBOSS plotcon program with an all-positive version of the ESIMILARITY matrix (and a window size of 20 residues). Gaps - including at the ends - are considered the same as a nonconservative amino acid subsititution.

Groups (of homologous proteins)

Further information

A listing of species with sequences in CCAPD is available. It includes information on species name variants and species merged (due to gene flow evidence and/or a high likelihood of confusion among sequence depositors) for some purposes. The choice of species names used is purely for the sake of maximizing recognizability, and may not correspond to current phylogenetic/taxonomic thinking.

Please see http://www.drallensmith.org/research/dissertation.final.pdf for:

More information on the alignment methodology
Further details on the proteins selected, including citations for information given in the group files (above)
An example usage of a prior version of the database for phylogenetic work

Under http://www.drallensmith.org/research/ are also examples of NEXUS-format data files produced using an earlier version of the database (also available at that location) and some of the programs used to produce CCAPD. Other programs can be found here; all are available under an open-source (GNU Affero GPL Version 3) license. Further data files can be found here.

Some further alignment formats are available under the group files, and more formats are in progress, although unfortunately most extant alignment formats have difficulty showing areas of uncertainty. The MSF-format sequence files, as used by the plotcon and showalign programs from EMBOSS and the local program consensus.weights.pl, contain weights for the sequences. These weights are at the present time largely arbitrary, but in the future MrBayes will be used to derive distances from which better weights can be found. The weights do not affect the alignments in any event; they are only of importance for the consensus sequences and for which letters are capitalized in the display.

CCAPD was created by Dr. Allen Smith with Dr. Peter Kahn (for structural alignment reviews) and Dr. Theodore Chase, Jr. (for alcohol dehydrogenases). Dr. Karl Kjer inspired both our use of structural alignments for phylogenetics and the creation of CCAPD for public usage.

This is viewable in Any Browser and is Valid HTML 4.01.

This webpage, and all other files in or under this directory (except for those explicitly copyrighted or licensed otherwise), are licensed (copyright 2001-2008) by Allen Smith under a Creative Commons Attribution-ShareAlike 2.5 License; also available is a text version of the Legal Code of the license. (Of course, factual material is not copyrightable - fortunately!)