Help
This programs finds identical sequences in a fasta input file.
Mouse over green text to read additional information. This web application was created by Björn Canbäck.
Algorithm
The aim is to produce clusters with identical sequences. This is problematic because the term identical in this case is not what could be expected. Consider, if we have three sequences:
 
1. cccgt
2. cccgttt
3. ccgta
Sequence 1 will be identical to 2 and 3, but obviously sequence 2 and 3 are not identical. Here an "algorithmic definition" of identical is used. First the longest sequence, the anchor, in the alignment is picked (sequence 2 above). Then sequences with decreasing length are compared and eventually added to the anchor sequence. When all sequences have been compared to the anchor, the outcome can be two. If no sequences were added, the anchor sequence will be output. If sequences were added, the new anchor will be compared to all sequences again with the exception of sequences that already have been added to anchors. When the anchor has been output, the next longest sequence will serve as an intital anchor. To get the same results even if the sequence order in the alignment file is changed, sequences with the same length are ordered according to their name (id). This sort is done alphanumerically. An alternative approach would be to minimize the amount of produced clusters with identical sequences. This is probably mathematically difficult and beyond my skills.
Options
Overlap (number of nucelotides):
Discard sequences shorter than (number of nucelotides):
Upload fasta sequence file: