Ross C. Hardison1, Belinda Giardine2, Laura Elnitski1,2, Cathy Riemer2, Scott Schwartz2, Matt Weirauch2, and Webb Miller2,3Departments of Biochemistry and Molecular Biology1, Computer Science and Engineering2 and Biology3, The Pennsylvania State University, University Park, PA 16802
The determination and annotation of complete genomic DNA sequences provide the opportunity for unprecedented advances in our understanding of evolution, genetics and physiology, but the amount and diversity of data pose daunting challenges as well. Currently, excellent browsers provide access to the sequence and annotations of the human and mouse genomes, most commonly accessed as a single gene or region at a time. In order to support complex queries across multiple types of information simultaneously and at multiple loci, we developed a database of human genome sequence alignments and annotations, called GALA. Strong conservation of DNA sequences between species can be a good predictor of function, thus we included data on sequence conservation based on whole-genome alignments between human and mouse. Extensive annotations are recorded, including RefSeq genes, gene predictions by several different models, functional information and expression data retrieved from NCBI's LocusLink, SNPs, and transcription factor binding sites identified by matches to weight matrices in TRANSFAC. Complex queries use a history page to combine results of simple queries by common set operations or with proximity and clustering features developed for GALA. An example of a complex query supported by GALA is "Find all clusters of two C/EBP-binding sites and one NF-E2-binding sites that are in regions conserved between human and mouse (at least 50 bp of at least 70% identity)." This returns 45 results, one of which is in the LIMK2 locus. Output from GALA queries can be viewed as a track on the UCSC Human Genome Browser, as a table of data with hyperlinks, or as text. Query results with alignments may be viewed using Laj, a versatile, interactive alignment viewer implemented in Java. The database is available on-line at http://globin.cse.psu.edu/ and http://bio.cse.psu.edu/. (Supported by NIH grant HG02238).
References:
GALA: Giardine et al. (2003) Genome Research, in press.
Laj: Wilson et al. (2001) Nucleic Acids Res 29: 1352-1365.