Evaluating Clustering Properties of Protein Sequence Datasets

Below are links to two compressed folders that contain the results of our experiments on the Pfam and SCOP datasets. There is a file giving the results for each dataset in our study: pFamDataset1.txt, ..., pFamDataset10.txt, scopDataset1.txt, ..., scopDataset8.txt, scopDatasetA.txt, scopDatasetB.txt.

Pfam Datasets

SCOP Datasets

File Format

The beginning of pFamDataset1.txt looks like this:

cluster size: 1704
152.0 0.0
161.0 0.0
148.0 0.0
148.0 0.0

cluster size: 5655
38.0 0.0
0.0 0.0
0.0 0.0
0.0 0.0

We sample 4 proteins from each ground truth cluster (10 proteins for the SCOP datasets). For each protein p in ground truth cluster C we use BLAST to evalute the similarity of p to proteins inside and outside of C. More specifically, the similarity of two proteins is the bit-score corresponding to their alignment, and 0 if BLAST does not detect any similarity between them. Each row gives the results for a particular protein. We record the median within-cluster similarity (25th percentile for the SCOP datasets), which is the first value, and the maximum between-cluster similarity, which is the second value.

In this example, for the first protein sampled from the first ground truth cluster (of size 1704) we find that its median within-cluster similarity is 152, meaning that half of the other proteins in the cluster have similarity greater than 152 to this protein. We also find that its maximum between-cluster similarity is 0, meaning that there was no similarity detected with proteins in any other cluster.

We expect proteins to be more similar to members of their own cluster than to members of the other clusters. If the first value is usually larger than the second, this dataset is more likely to have the structure implied by the (c,epsilon) property, allowing us to accurately cluster it.