After base call process, vector sequences and tag sequences are removed. Then the sequences are grouped with TGICL program (MegaBLAST and CAP3 based) and an in-house program considering 3' / 5' mate-pair information from same clone. Furthermore sequences assemble with CAP3 program is performed against each groups. After assemble, CAP3 output is formed into either contigs or singlets. Here we put 3' assembly and 5' assembly into the same cluster even if those contigs have gaps because we regard such a cluster corresponds to a gene.
In the base call process by Phred basecaller, a FASTA format sequence file and a quality information file, called QUAL file, are generated per a piece entry of 3' end or 5' end of each clone of the cDNA library. One clone has two FASTA entries from both 3' end and 5' end, and their corresponding two QUAL entries except about 15 % of clones which have only either 5' EST or 3' EST.
Masked FASTA entry example: Bold characters are masked fraction.
>NIASHv1059F07_R.ab1 CGTACATG 925 14 523 ABI 99 424
To prevent misassemble because of the similarity of these repetitive sequences, we should screen out these repetitive sequences before assemble. For this purpose we use RepeatMasker, which convert uppercase characters into lowercase characters in repetitive region of target sequences. At the result of this masking process, one FASTA entry likely possesses some patches of lowercase regions, which are ignored in the following grouping process. We used TIGR wheat repeat library and TIGR barley repeat library as repeat references.
Grouping before assemble
To reduce requiring memory in assemble process, we adopted the EST clustering program TGICL (TIGR) followed by in-house reclustering program using physical connection information.
In our customized process of grouping process, we used TGICL to group sequences based on only sequence similarity. Then an in-house program reconstructs these groups using clone pair information. In the concrete, TGICL gave "-X" option generate a cluster file including grouping information.
TGICL execution option
tgicl [FASTA file] -q [QUAL file] -p 99 -X
The cluster file and clone pair information file, are in into the in-house program, which generates a reconstructed cluster file.
We assemble 'X' masked sequence with CAP3. CAP3 generate contig representative sequences and an out put report which represents contigs and their component piece entries.
Clustering after assemble
To estimate total gene cluster dataset of barley, we put some contigs into a cluster using mate-pair info.
FLcDNAs are clustered by BLASTN search with the thresholds of 95% or more identity and less than 10-5 of E-value.