The aim is to produce clusters with identical sequences. This is
problematic because the term identical in this case is not what
could be expected. Consider, if we have three sequences:
1. cccgt
2. cccgttt
3. ccgta
Sequence 1 will be identical to 2 and 3, but obviously sequence
2 and 3 are not identical.
Here an "algorithmic definition" of identical is used. First the
longest sequence, the anchor, in the alignment is picked (sequence
2 above). Then sequences with decreasing length are compared and
eventually added to the anchor sequence. When all sequences have
been compared to the anchor, the outcome can be two. If no
sequences were added, the anchor sequence will be output. If
sequences were added, the new anchor will be compared to all
sequences again with the exception of sequences that already
have been added to anchors. When the anchor has been output,
the next longest sequence will serve as an intital anchor.
To get the same results even if the sequence order in the
alignment file is changed, sequences with the same length
are ordered according to their name (id). This sort is done
alphanumerically.
An alternative approach would be to minimize the amount of
produced clusters with identical sequences. This is probably
mathematically difficult and beyond my skills.