CCAPD: Consolidated Curated Alignments for Phylogenetics Database

The CCAPD is a multiple sequence alignment database of 1215 protein sequences (including variants) - in 28 homologous groups, retrieved from 217 species. 233 of the protein sequences have structures associated, and all 28 homologous groups have at least two sequences with structures associated with them. The database concentrates on sequences and species of phylogenetic utility, including for the use of phylogenetics in structural biology. CCAPD consolidates manually-curated structurally-informed alignments from 3D_ali, HOMSTRAD, and Pfam as well as locally-performed structural alignments with manual review. Uncertain regions in the alignment (due to unreliable structural information, disagreements between databases, or other reasons) have been determined with manual review.

Interpretation of files

The primary alignment file format is the HTML display version produced using the showalign program (with the locally-created ESIMILARITY matrix) from EMBOSS. (This was then processed further using locally-created Perl programs.)

Showalign format interpretation

Interpretation of the primary showalign format files

Alternative showalign formats
Two alternative versions of the showalign-produced alignment files are also available. (These are recommended for users of non-graphical browsers.)

Sequence origins are coded in these files using the same characters as above, but without colors.

Plotcon plots of residue conservation

Plots of degree of residue conservation versus the position along the alignment are also available (in postscript and PDF formats). The plots were produced using the EMBOSS plotcon program with an all-positive version of the ESIMILARITY matrix (and a window size of 20 residues). Gaps - including at the ends - are considered the same as a nonconservative amino acid subsititution.

Groups (of homologous proteins)

Further information

A listing of species with sequences in CCAPD is available. It includes information on species name variants and species merged (due to gene flow evidence and/or a high likelihood of confusion among sequence depositors) for some purposes. The choice of species names used is purely for the sake of maximizing recognizability, and may not correspond to current phylogenetic/taxonomic thinking.

Please see for:

Under are also examples of NEXUS-format data files produced using an earlier version of the database (also available at that location) and some of the programs used to produce CCAPD. Other programs can be found here; all are available under an open-source (GNU Affero GPL Version 3) license. Further data files can be found here.

Some further alignment formats are available under the group files, and more formats are in progress, although unfortunately most extant alignment formats have difficulty showing areas of uncertainty. The MSF-format sequence files, as used by the plotcon and showalign programs from EMBOSS and the local program, contain weights for the sequences. These weights are at the present time largely arbitrary, but in the future MrBayes will be used to derive distances from which better weights can be found. The weights do not affect the alignments in any event; they are only of importance for the consensus sequences and for which letters are capitalized in the display.

CCAPD was created by Dr. Allen Smith with Dr. Peter Kahn (for structural alignment reviews) and Dr. Theodore Chase, Jr. (for alcohol dehydrogenases). Dr. Karl Kjer inspired both our use of structural alignments for phylogenetics and the creation of CCAPD for public usage.

This is viewable in Any Browser and is Valid HTML 4.01.

This webpage, and all other files in or under this directory (except for those explicitly copyrighted or licensed otherwise), are licensed (copyright 2001-2008) by Allen Smith under a Creative Commons Attribution-ShareAlike 2.5 License; also available is a text version of the Legal Code of the license. (Of course, factual material is not copyrightable - fortunately!)