Advanced PipMaker Instructions

Advanced PipMaker processes the contents of the same four files as does the basic form of PipMaker. In addition, a number of choices, options, and new capabilities are provided.

TABLE OF CONTENTS

Option: color

Regions of interest in the first sequence, such as the locations of exons or regulatory elements, can be marked by colored vertical stripes. To use this feature of Advanced PipMaker, upload a first sequence underlay file, containing lines like:

1000 2000 LightBlue
Colors must be selected from a restricted list. It is possible to paint just the upper half of a vertical stripe by requesting, e.g.:
50060 50227 Red +
(To see the results of that command, look at the BTK pip.) Use "-" to paint just the lower half. This permits annotations to differentiate between the two strands, or to plot potentially overlapping features like gene predictions and database matches. Beware that the appearance of the colors will vary from one printer or monitor to the next.


Pick one: Search one strand or Search both strands

The alignment program can optionally report only regions of similarity between the first sequence and the second sequence in its given orientation. The default is to align the first sequence with both the second sequence and the reverse complement of the second sequence.


Pick one: Show all matches, Chaining, or Single coverage

With the default setting of "Show all matches", it is possible for one region of the first sequence to align with several regions of the second sequence because of duplications of a gene or an exon, or because of incomplete masking of interspersed repeats or low-complexity regions. Such duplications cause lines to appear one over the other in the pip. PipMaker provides two options for eliminating such duplicate matches, each with its own strengths and weaknesses.

If the "Chaining" option is choosen, then PipMaker will identify and plot only matches that appear in the same relative order in the first and second sequences. This option should be used only if the genomic structures of the two sequences are known to be conserved, since otherwise a duplication might avoid detection. With Chaining, the alignment program is run with lower thresholds (i.e., higher sensitivity).

For an example of chaining, consider the first pip shown below, which is taken from a larger pip based on a 31 kb sequence. As the pip indicates, exon 7 of the first sequence has a number of matches in the second sequence, presumably due to duplications of that exon or of the entire gene. The dotplot view of the entire alignment, shown at the left of the second row, also indicates the duplication (at around position 7000 on the horizontal axis), as well as duplications in later exons. The panels on the right show the results of specifying the "Chaining" option.

An alternative method for avoiding duplicate matches is provided by the "Single coverage" option, which selects a highest-scoring set of alignments such that any position in the first sequence can appear in at most one alignment (though there is no guarantee that order of matching regions is identical in the two sequences). The following three dotplots show a case where this option works better than does chaining. The top panel shows all matches in a gene cluster where the first sequence has six copies of the gene, while the second sequence has four. With chaining, only four of the genes can be matched, as shown in the second panel. The "Single coverage" option selects one match for each region of the first sequence (panel 3).

Further discussion of this example can be found on the Examples page, under beta-globin.

Chaining is preferable to the "Single coverage" option in cases where (1) the second sequence is contigous, (2) the comparison is with just a single strand of the second sequence and (3) the order of conserved regions is identical in the two sequences, since under conditions 1-3 the results from the "Chaining" option will be more biologically meaningful. (Also, the comparison will be faster.) However, the three-panel example shows that if condition (3) does not hold, the "Chaining" option may give inferior results. A strength of the "Single coverage" option is that it guarantees single coverage even if the second sequence is compared in both orientations, or if it is fragmented (see below). In such cases, the "Chaining" option is applied separately each time the first sequence is compared with an orientation or fragment of the second sequence, so multiple coverage can result.


Select forms of output

The user selects output files from the following list: the pip, a one-page dotplot view of the alignments, the condensed form of the alignment, the traditional textual form of the alignment, and an analysis of the exons. The user can supply an optional title for the pip. (By default, the title is taken from the first line of the exons file.) The user can request that the output be in PostScript format instead of PDF format.


Analysis of exons

For the optional analysis of exons, a program attempts to use the alignments to map each position in the "exons" file into the corresponding position in the second sequence. For each coding-region specification (i.e., line beginning with "+") the first and last codons are displayed, and for each intron the first and last two nucleotides are shown. These tri- and dinucleotides are shown for the first sequence and, if possible, for the second sequence. If the coding region is specified, then the putative coding region is printed (for both sequences, if possible). For example, if the exons file contains:

< 27591 30475 L44L
+ 27641 30438
27591 27661
27932 28054
29660 29727
29918 30023
30436 30475
then the program might generate:
< L44L: 27591-30475     35159-37479
  CDS:  27641-30438     35202-37442     TTA-CAT TTA-CAT
   5:   27591-27661     35159-35222     CT-AC   CT-AC
   4:   27932-28054     35586-35708     CT-AC   CT-AC
   3:   29660-29727     36581-36648     CT-AC   CT-AC
   2:   29918-30023     36976-37081     CT-AC   CT-AC
   1:   30436-30475     37440-37479

>L44L, putative CDS for sequence 1 (321 bp)
ATGGTTAACGTCCCTAAAACCCGCCGGACTTTCTGTAAGAAGTGTGGCAA
GCACCAACCCCATAAAGTGACACAGTACAAGAAGGGCAAGGATTCTCTGT
  ....
This output shows that the L44L coding region in both sequences begins with ATG (the reverse complement of CAT) and ends with TAA. Similarly all four introns conform to the normal GT-AG splicing consensus.

When the second sequence file is split into contigs, the output is somewhat more complicated. It begins with an enumeration of the contigs (using their FastA header lines), and positions in the second sequence are identified as i:j which denotes position j in contig i. For instance, the index might include:

Index of fragments of the second sequence:
	1: >Contig101
	2: >Contig102
and the listing of exon positions might contain:
< L44L:	27591-30475	1:35159-2:1479
  CDS:	27641-30438	1:35202-2:1442	TTA-CAT	TTA-CAT
   5:	27591-27661	1:35159-1:35222	CT-AC	CT-AC
   4:	27932-28054	1:35586-1:35708	CT-AC	CT-AC
   3:	29660-29727	2:581-2:648	CT-AC	CT-AC
   2:	29918-30023	2:976-2:1081	CT-AC	CT-AC
   1:	30436-30475	2:1440-2:1479
Here, the left-most CDS position (end of the stop codon) aligns with position 35202 in contig 1, which has FastA header line ">Contig101".


Raw blastz output

The alignment file produced by PipMaker's blastz program can be viewed in laj, which is an interactive tool for viewing and manipulating pairwise alignment output. The laj program, which is written in Java, must be down-loaded and run on the users computer.


PDF with hypertext annotations

Notes on configuring Acroread:

Follow the menu File > Preferences > Weblink and set the Link Information selector to "Always Show".

If the PDF version of the Pip is viewed using Adobe Acrobat Reader, then a message that corresponds to the mouse position appears in the status area at the bottom of the Acrobat window. When the mouse is positioned in a region of the pip corresponding to a local alignment, the message describes the region covered by the alignment, whereas between two alignments it describes the gap. The first word on the FastA header line for the second sequence, or a contig thereof, is reported. This indicates which contig aligns with this region of the first sequence.

For example, the following messages might be associated with various regions of a certain sequence (submitted as the first sequence to PipMaker).

   region		message
   1-1400	1400 bp unmatched
1401-1586	1401-1586 matches 3146-3324 of Contig35
1587-2191	605 bp gap; 889 bp in Contig35
2192-3367	2192-3367 matches 4214-5531 of Contig35
3368-3888	521 bp gap; between Contig35 and Contig41-
3889-3934	3889-3934 matches 6-51 of Contig41-
3935-4089	155 bp gap; 1013 bp in Contig41-
4090-4375	4090-4375 matches 1065-1340 of Contig41-
The notation "Contig41-" refers to the reverse complement of the sequence with FastA header ">Contig41 ...". In this example, the first sequence has two local alignments with Contig35, followed by two local alignments to the reverse complement of Contig41. Note that the latter pair of alignments are separated by 155 bp (positions 3935-4089) in the first sequence and 1013 bp in Contig41-, perhaps due to an insertion at this point in the second sequence.

These messages in the pip are not intended to be clickable -- we are slightly abusing the URL mechanism by storing the messages as URLs, because it is the most convenient way to display a short string in the status area.


Second sequence can be in contigs

PipMaker permits the second "sequence file" to, in fact, contain several sequences, such as the contigs produced by draft sequencing (i.e., before a clone is completely sequenced). The sequences in the second file are to be separated by FastA header lines, each beginning with a ">" character. No additional effort on the user's part is needed to handle a fragmented second sequence. The first sequence file, however, must contain a single sequence.


User-controlled masking of interspersed repeats and low-complexity regions

PipMaker attempts to eliminate spurious alignments caused by interspersed repeat elements and low-complexity regions, yet indicate meaningful alignments that may involve such elements. For instance, an interpersed repeat or tri-nucleotide repeat sequence may be part of a protein-coding region. The strategy is not perfect, however, and the user may want to take contol of the way these elements are handled during construction of the alignment.

The alignment program used by PipMaker interprets lower case letters as indicating regions of the sequence that are to be masked at early stages of the alignment process, but not at later stages. The alignment program follows the general design of the "gapped Blast" family of programs, which start by finding short, exact matches, then extend those matches to alignments that include gaps. PipMaker ignores regions of the first sequence that contain lower case letters when searching for exact matches, but utilizes those regions when expanding exact matches to form longer alignments. (Regions containing only "N" or "X" characters are not aligned in either phase.) Users of the Advanced PipMaker page can control masking by submitting sequences that follow this convention.