Potato Ests: Documentation

Contents

1.0 EST Sequencing
2.0 EST Processing
3.0 Removal of Late Blight Sequences
4.0 EST Clustering and Assembly
5.0 Function Annotation
6.0 Gene Ontology
7.0 Terminology
8.0 References





1.0 EST Sequencing

Potato ESTs were sequenced using the MegaBACE 100 or 4000 sequencing machine and Et-terminator chemistry



2.0 EST Processing
  • 2.1 Base Calling
  • Bases were called using Phred (version 0.020425.c; Ewing and Green, 1998; Ewing et al.,1998)

  • 2.2 Vector Removal and Low Quality Base Trimming

    • Vector and adapter sequences were masked using the crossmatch program

    • Customized perl scripts were used to mask low quality bases. Starting at both the 3` and 5` ends of a sequence, a sliding window of iterates inward as long as the average quality value within the window is greater than the threshold. Two window sizes are used, a smaller one of 5 bases, and a larger one of 30 bases, to detect both small and large regions of low quality data. A threshold of 10 was used

    • Masked regions were removed from the sequences

  • 2.3 Low Quality Sequence Removal
  • Sequences with greater than 3% Ns were removed from further analysis

  • 2.4 Remove Contaminant Sequences
  • Sequences were compared to each of the four following reference libraries

    • E. coli complete genome
    • Arabidopis mitochondrial sequences
    • Arabidopsis chloroplast sequences
    • Solanum rRNA and snoRNA sequences

    Any significant matches were removed from further analysis

  • 2.5 Minimum and Maximum Length Filter
  • Sequences with fewer than 100 bp, or greater than 4000 bp are removed from further analysis

  • 2.6 Temporarily mask solanum repeats
  • EST sequences are compared to a reference library of plant repeats containing:

    • Satellite DNA repeats
    • Transposable elements
    • Repetitive sequences
    • Retrotransposons
    • SINES, LINES

    Any significant matches were temporarily masked during the assembly process

  • 2.7 Temporarily mask low complexity regions
  • Any low complexity regions including poly A/T are temporarily masked during the assembly process



3.0 Removal of Late Blight Sequences

Cleaned EST sequences were screened (E-value cut-off of < 1.0 e-30), using BLASTN, against a comprehensive late blight nucleotide database, as well as a plant nucleotide database made from a subset of the GenBank nucleotide database. ESTs with a lower E-value hit to a late blight sequence than a plant nucleotide sequence were removed from further analysis.



4.0 EST clustering and Assembly

ESTs were clustered and assembled using with the Paracel Transcript Assembly program (version 2.7; Paracel, Pasadena, CA). Following clustering, ESTs were assembled into high quality contigs if they had a 94% similarity, a minimum overlap of 40 bp, and gaps less than 9 bases long. Chimeric sequences were identified during the assembly process and removed to prevent misalignments.



5.0 Functional Annotation

Homology searches of CPGP contigs and singletons were conducted with the Basic Linear Alignment Search Tool (BLAST) procedures (Altschul et al., 1990). The unigene set was searched using BLASTX against a local copy of the GenBank protein database (http://www.ncbi.nlm.nih.gov/). A putative function was assigned to a sequence using an E-value cut-off of < 1.0 e-10.



6.0 Gene Ontology

A Gene Ontology (http://www.geneontology.org/) term was assigned to the sequences in the unigene set by screening the sequences, using BLASTX, against the Arabidopsis thaliana protein database (http://www.Arabidopsis.org). Sequences were assigned the Gene Ontology term of the annotated Arabidopsis thaliana blast hit (E-value cut-off of < 1.0 e-10), and then grouped according to classification.



7.0 Terminology
  • EST: Expressed Sequence Tag. Sequences derived from the partial sequencing of a cDNA clone.
  • Cluster: A group of sequences with a high degree of sequence similarity. Contigs/Singletons within a cluster will often represent alternative splice forms of a gene, or members of the same gene family.
  • Contig: Sequences derived from the assembly of contiguous, overlapping ESTs.
  • Singleton: ESTs that do not assemble with any other EST, and are therefore unique instances of a gene.


8.0 References

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.

Ewing B, Green P (1998) "Base-Calling of Automated Sequencer Traces Using Phred II. Error Probabilities". Genome Res. 8:186-194.

Ewing B, Hillier L, Wendl MC, Green P (1998) "Base-calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment". Genome Res. 8:175-185.

Ronning CM, Stegalkina SS, Ascenzi RA, Bougri O, Hart AL, Utterbach TR, Vanaken SE, Riedmuller SB, White JA, Cho J, Pertea GM, Lee Y, Karamycheva S, Sultana R, Tsai J, Quackenbush J, Griffiths HM, Restrepo S, Smart CD, Fry WE, Van Der Hoeven R, Tanksley S, Zhang P, Jin H, Yamamoto ML, Baker BJ, Buell CR. (2003) "Comparative Analyses of Potato Expressed Sequence Tag Libraries". Plant Physiol. 131(2):419-429.