Human--Mouse genomic sequence comparisons,

by Scott Schwartz, Laura Elnitski, Webb Miller, Ross Hardison.

(Last change: Fri Jun 22 16:35:23 EDT 2001)

Homology between human and mouse sequences serves as a useful guide for identifying genes in both organisms. Additionally, conserved sequences that have no coding potential are good candidates for regulatory elements. With this in mind, we set out to test the feasibility of aligning the finished human genome with the set of currently available mouse sequences (known as reads) which can be found at the NCBI or ENSEMBL web sites. The mouse genome is finished to 3X coverage, represented in 15-16 million reads of roughly 500 bp each. The position of many of these reads, as they correspond to the human sequence, is provided at The Human Genome Browser BLAT track. Since this comparison was made at the amino acid level, it may or may not represent the full coverage of reads across any given human region.

Our first goal was to have an alignment of a well annotated region for comparison of our subsequent alignments. We have examined several regions so far, and plan to do more in the near future. By way of example, we'll discuss our analysis of the human Major Histocompatibility Complex (MHC) which spans over 3.6 Mb of DNA and contains over 200 genes. The sequence of the human MHC region from telomere to centromere was obtained from the Sanger Center web site along with annotations for all of the known genes and pseudogenes in the region. Further annotations from GenBank were added, including exon boundaries.

The annotation files are provided here (they are subject to change when we update them.) human sequence, mouse sequence, human repeats, human exons, underlay file.

Detailed documentation for these file formats is available from the PipMaker web pages.

This sequence was compared to a finished mouse sequence that covered MHC class II and class III genes. Since additional mouse sequence is available in the form of working draft assemblies, we chose to extend the analysis by including some mouse contigs that correspond to genes in the human class I region as well. The total mouse sequence covers 2.5 Mb.

We made a dotplot and a pip and a single coverage pip of the alignment between finished (and masked) human MHC and finished mouse MHC computed with blastz.

We'll compare this result to similar alignments using selected mouse reads.

As an initial test, we used the BLAT alignments of the mouse reads that fell within the coordinates of the MHC region. A dotplot showing the comparison between the human sequence and the collection of BLAT hits generates a diagonal line that follows a similar pattern to that of the finished/draft alignment, with a few differences. For instance, the BLAT alignment appears to be a 1:1 correspondence of genes between the human and mouse sequences; therefore, the paralogous members of the class I and class II families do not appear in the comparison.

It should be noted that the human MHC sequence available at the Sanger Center is slightly different than the assembly available at the Human Genome Browser. Therefore, annotations obtained from the HGB do not match the precise location of the same features in the Sanger version. However, the overall difference between the two sequences is limited to a few places, which can be seen in a dotplot comparison of the two sequences. The version available at the HGB was used to identify homologous mouse reads by the program BLAT and could potentially show a slightly different alignment pattern.

Next we extracted the blat selected mouse reads, and aligned them using blastz. Several blat alignments might refer to a given read (once for each strand, for example), so we collect each read at most once. Moreover, we collected the reads in the order that they appear in the fasta files from NCBI. That means the reads will appear in unpredictable order relative to the human sequence. This is apparent in the raw dotplot we generated. Using the OandO program, we were able to order and orient the mouse reads according to blastz's predicted correspondance with the human sequence. We then generated a ordered dotplot for that case. These ordering issues are not relevent to the pip which we also generated.

We wanted to see if a nucleotide-based search would find mouse reads missed by the BLAT search using translated sequences. We used the mouse finished sequence (gi|6670848) to find aligning mouse reads. Since these should be almost exact matches, we could use a rapid, stringent, search tool such as megablast, a program that was designed for finding almost exact matches in very large datasets, such as working draft sequences.

Megablast selected reads were aligned to the finished human mhc sequence using blastz. As before, we ordered the mouse reads as predicted by blastz, producing an ordered dotplot, and a pip.

A side-by-side comparison of the pip showed that BLAT did a remarkably good job of finding mouse reads. For example, the region 2140K..2240K in human shows similar coverage for all three mouse sequences. However, megablast-selected included some genes missed by the BLAT selection. In the region 2040K..2140K the megablast-selected reads align with HSPA1L genes, whereas BLAT-selected reads do not. The complete side-by-side comparison is available in this multipip.

We conclude that the BLAT search of translated sequences missed some relevent mouse reads. The feasability needs to be investigated of using a rapid alignment program for a nucleotide-level search for mouse reads that match human.