A Review of the AAAS Seminar
"Beyond the Human Genome"
February 17-18, 2001
San Francisco, California
L. Stephen Coles, M.D., Ph.D.,
Director for Education and Internet Content
The Kronos Longevity Research Institute
Dr. Francis Collins, Director of the Public Project, and
Dr. Craig Venter, CEO of The Celera Genomics Group,
were keynote speakers on consecutive evenings.
The following review of the two-day Beyond the Human Genome Seminar held in conjunction with the AAAS Annual Meeting at the San Francisco Hilton on Saturday and Sunday, February 17-18th summarizes my personal view of the 14 most important surprises that I encountered after studying the recently-published versions of the assembled human genome in both Science and Nature magazines [1,2] as well some of the original material presented by the 21 speakers at the Seminar itself.
For those with access to the Internet, a full streaming-video version of the Seminar will be available free-of-charge in about six weeks from the professional video recording company that taped the presentations from their website; about the same time, a CD-ROM of the slide presentations will be available for purchase . Dr. Collins during his Saturday evening Plenary Lecture showed a ten-minute video on the Human Genome Project (targeting high-school students) that will be available free-of-charge to the public and can be requested at the NIH human genome website .
II. Surprises in the Human Genome
Surprise 1. The human genome contains substantially fewer genes than we thought it would. Both competing groups (the privately-funded Celera Genomics Group and the publicly-funded International Human Genome Project) have identified only 26,588 (plus another 12,731 possible candidate) genes among our 3.1 billion base pairs. This controversial, low-side estimate of under 40,000 human genes was revealed to be significantly below the original estimate of 100,000 genes that was made some ten years ago and which was based on a preliminary examination of one relatively dense chromosome (Chromosome 19).
To discover that our own genome is comparable in size to that of a simple weed (Arabdidopsis thaliana has recently been found to contain 25,498 genes ) may have been disconcerting to some (who were looking for a way to "explain" human complexity in comparison with "lower" forms of life), but this was not the case for everyone. Some of us realized beforehand that bricks and blue prints are different. If genes could to be thought of as "bricks" and complete organisms as "structures composed of bricks," then it should be clear that one could take a large pile of bricks and build either a shopping center or a cathedral -- it all depends on the blue prints one uses during construction (embryogeneic programs in the case of organisms). Don't architecture students spend most of their time learning how to draw blue prints that are meaningful to builders? They don't learn how to manufacture a wide variety of self-assembling, special-purpose bricks. Those pharmaceutical companies, for example, who were hoping to find a unique gene for every single disease condition-of-interest (so they could patent them before anyone else) are doomed to be disappointed.
Surprise 2. The human genome is about the same size as the mouse genome. In particular, there are only about 300 genes in the human genome (less than one percent) for which there are no counterparts in the murine genome. Therefore, it should be expected that all mammals from primates (humans, chimps, apes, and monkeys) to dogs, cats, horses, cows, pigs, goats, sheep, rodents, whales, dolphins, giraffes, elephants, etc. will have genomes of essentially the same size. The blueprints will be different (homeoboxes), of course (giraffes have long necks while elephants have short necks but long trunks), yet the genetic homologies will be coextensive.
Surprise 3. The human genome is " lumpy." The genome is filled with vast "geneless" deserts ("junk DNA" and repeat units) punctuated by a few oases of actual genes in a cluster ( only 1.1 percent of the genome consists of genes; 98.9 percent consists of non-mRNA-coding DNA regions). It sort of looks like the map of North America seen from the space shuttle at night -- cities on the East and West Coasts (Boston, New York, Philadelphia, Washington, D.C., Atlanta, Miami and Vancouver, Seattle, San Francisco, Los Angeles, San Diego, Tijuana) with Chicago and Houston somewhere in the middle. The midwest, for example, is a vast dessert of darkness in the night sky. Chromosome 19 was cited as an example of an "urban area" with a very high density of genes.
Surprise 4. [40 - 50] percent of all the genes identified have no known function or even category in which to place them. The protein specialists have their work cut out for them for the next five years trying to decipher all this sudden block of data thrust upon them. Speaking metaphorically, it was like walking into the middle of a strange gymnasium packed with odd living-room furniture in the dark and then having someone turn on the overhead lights so we could look around (panoramically) and see things for the first time. "Oh, there's a sofa; there's a chair. But what's that thing over there that looks like a sculpture? I don't have a clue what people would need that for."
Surprise 5. On the average, human genes synthesize 3.1 proteins per gene using a method called "alternative splicing" (by rearranging or deleting the introns after splicing out the exons). This is much more than the [1.0 - 1.5] proteins per gene typically synthesized by more primitive organisms like the nematode worm C. elegans. So our proteomes (the full catalog of different proteins in the organism) are significantly larger than worms after all.
Surprise 6. The individual protein architecture is much more complex in human proteins. Speaking metaphorically, a bacterial splicing enzyme could be thought of as a simple "pair of scissors," while a corresponding human enzyme could be thought of a blender ("Cuisinart") with lots of dials and variable speed. This happened through evolution by a process called "domain accretion" in which the ends of the genes had new DNA added over time.
Surprise 7. On at least two occasions bacterial gene cartridges moved over horizontally as a block into human chromosomes (~200 genes in one fell swoop).
Surprise 8. Some genes have been duplicated into other chromosomes several times (sometimes the splicing was inverted as right-to-left instead of left-to-right). There is at least one case in which a subregion of a gene was respliced in an inverted fashion so that the reading frame of the interior portion was correct, while surrounded by two out-of-order reading frames (an inversion within the inversion).
Surprise 9. Junk DNA extends the genome backward in time by 800 million years in the fossil record, so to speak.
Surprise 10. Junk DNA can be important (or actually useful). For example, some "alu" repeats [200 - 300] bp's repeated over one million times can be found near genes that are well conserved over a long period of time. Therefore, the repeats must be useful, like in regulating the timing of key events.
Surprise 11. Male mutation rates are double female rates. Men can take the credit for species innovation but they also must take the blame for the propagation of many inherited diseases.
Surprise 12. Only 0.1 percent of genes contain single-point nucleotide polymorphisms (SNPs) or substitutions. For a fixed species mutation-rate per millennium, this would suggest that there was a very small founder pool of ~100,000 individuals who migrated throughout the world from a single place (somewhere in Africa) fairly recently (~145,000 years ago). The SNP variation is much greater in chimps (x2), monkeys, and other great apes (x4) consistent with their being much older species
Corollary12a. Racial variation among human populations (Blacks, Orientals, Caucasians) is contained within the SNP variation, which means that racial variation is not very significant at the genomic level.
Corollary 12b. Forensic identification of a suspected criminal based on DNA left at the scene of a crime (by sequencing short tandem repeats), however, can still be done to an accuracy of 1 in 240 billion individuals.
Surprise 13. There are 316 aging genes in C. elegans. [Ref. Drs. Tom Johnson of the University of Colorado in Boulder and Stuart Kim of the Stanford University Medical Center].
Surprise 14. The complete human proteome exceeds 250,000 proteins. We still have our work cut out for us.
III. The Next Grand Challenge for Biology and Medicine
To get some perspective on the scope of this accomplishment, we need to ask how this fits in with prior developments in biology over the last five years and what will be attempted in the next five years. Besides the completion of the sequence and assembly of the human genome, some 40 different genomes have been completed to date (including various viruses, bacteria, yeast, microscopic worms, plants, fruit flies, and mice). About 60 more genomes are currently underway and will be completed before the end of 2001 (the leprosy bacterial genome was just announced yesterday). The next major mammalian genome to be attempted will probably be chimps. It is likely that domesticated animals (horses, cows, pigs, sheep, goats, etc.) will be done after that, for the obvious economic reasons.
Some representative bird, like chickens (also chosen for their economic value) needs to be sequenced. Obviously, bats are important, since, as flying rodents, they comprise the greatest number of mammals on the planet. In particular, gerontologists would like to know why they live on average more than four times longer than rats -- a great feat indeed. Similarly, for parrots. According to the latest data, the number of human centenarians appears to be increasing on a per capita basis, but one would need a millennium to figure out whether this was merely a statistical anomaly. Furthermore, if it were true, this affect needs to be distinguished from an evolutionary tendency for lifespan to increase for all vertebrates over the last few thousand years, based on the fossil record .
The dollar cost of sequencing a new species is still significant. The original estimate to sequence the human genome back in 1995 was on the order of $3 billion (one dollar per base pair) The current cost for a single pass (1x) of a human-like genome would be on the order of $[10 - 12] million; therefore, $30 million for the human genome project would not be far from wrong. However, it is estimated that in just four years, with improvements in productivity, the cost will fall to less than $1 million for a 1x pass.
One of the major components of the final cost of the effort is not the sequencing itself but the assembly of fragments into a continuous linear sequence located on a particular chromosome. Celera used a 3 TF (Tera Flop [floating point operations per second]) parallel-architecture Compaq computer system (using a design based on Digital's Alpha chips) for this assembly. The machine was rated as the fastest non-military supercomputer in the world today. The public project assembly was accomplished at the University of California in Santa Cruz using a set of Intel PC parallel-architecture machines. In about three years, a 1 PF (Peta Flop [= 1,000 Tera Flops]) machine will be available. The new supercomputer will result from a collaboration between Compaq and Sandia National Laboratory (which currently makes the fastest military supercomputers used by the DOE to simulate nuclear weapons explosions) and is part of the basis for estimating a significant reduction in the final cost. IBM is in the process of developing a similar high-performance machine ( Blue Gene) to deal with the so-called "protein folding" problem.
It should also be noted that in comparing the two different data bases (Celera and Public), the main criteria to distinguish them is by the quality of the annotation associated with each gene and the tools available for homology search (such as the BLAST search engine) across different species. The graphical user interface (genome browser) is also important for user friendliness. With respect to these criteria, the Celera data base is considered to be slightly better than the public data base. However, other commercial vendors, all of whom have access to the public data on Genbank, may be even better than either of these, such as the Prophecy Data Base from DoubleTwist, Inc. of Oakland, CA. Also, other proprietary data bases, such as those from Incyte Genomics of Palo Alto, CA or Human Genome Sciences of Rockville, MD, may have particular advantages, since these proprietary data bases are derived form cDNA (and mRNA) rather than from raw sequence, where the punctuation marks (start and stop codons) are not always crisply defined.
The next " grand challenge" for biologists will be to understand the collection of proteins that are created by these [30 - 40] thousand genes -- their 3-dimensional shape (conformation), their function, their relationship to one another (quaternary structure), and their participation in linear pathways or weblike networks of interactions within the nucleus, cytoplasm, or in the blood stream. This will be important for the process of "drug discovery." But it will also be important for stem-cell technology. For example, understanding exactly where in the embryogenic program individual genes become active or subsequently become shut down will be important for finding out how stem cells form tissues, maintain the shape of organs, or heal wounds. How adult salamanders and axolotls can lose and grow new tails will be of great interest, once, of course, we understand why giraffes have long necks or elephants have long trunks.
The implications for drug discovery and for new treatments for immune incompetence, heart disease, stroke, cancer, and diabetes are obvious. Understanding "The Book of Life" will make the medicine of the 21st century qualitatively different from anything that has gone before. This weekend represented an important advance for the biological sciences. Such progress in our own lifetime is so significant that it permits us to seriously ask still more questions -- ones that we dared not ask before for fear of ridicule by our more conservative colleagues. May we continue to live in interesting times.
1. J. Craig Venter, Mark D. Adams, et al, [with 275 co-authors] "The Sequence of the Human Genome," Science, Vol. 291, No. 5507, pp. 1304-1351 (February 16, 2001). http://www.aaas.org.
2. Francis Collins and the International Human Genome Sequencing Consortium, "Initial Sequencing and Analysis of the Human Genome," Nature, Vol. 409, pp. 860-921 (February 15, 2001).http://www.nature.com.
3. DigiScript, Inc., 113 Seaboard Lane, Suite C-270; Franklin, TN 37067; Voice: 615-778-0780; http://www.digiscript.com.
4. Francis Collins, Editor, Educational Video of the Human Genome Project (TRT = ~10 minutes), http://www.nhgri.nih.gov/educationkit.
5. Natasha V. Raikhel, Editor-in-Chief, " Arabidopsis Genome: A Milestone in Plant Biology," Plant Physiology, Special Issue, Vol. 124, No. 4, pp. 1-1865 (December 2000).
6. Mark S. Frankel and Audrey R. Chapman, "Human Inheritable Genetic Modifications: Assessing Scientific, Ethical Religious, and Policy Issues," (September 2000) http://www.aaas.org/spp/dspp/sfrl/germline/main.html.
7. DoubleTwist, Inc. 2001 Broadway; Oakland, CA 94612; Voice: 510-628-0100; http://www.doubletwist.com (Internet access by means of a standard browser requires a subscription fee of $7,000 for ~500 annotated sequences).
8. Leonard Hayflick, University of California at San Francisco, "Biological Limits of Human Longevity," p. A39-40, Proc. of the AAAS 2001 Annual Meeting and Science Innovation Exposition (San Francisco, California; February 18, 2001).
9. J. Netting and L. Wang, "The Newly Sequenced Genome Bares All," Science News, Vol. 159, No. 7, pp. 100-101 (February 17, 2001).
10. "On Human Nature: Rival Versions of the Human Genome Have Been Published at Last. Despite Arguments between the Teams that Produced Them, the Results Are a Huge Step Towards a Proper Understanding of How Humans Work," The Economist, pp. 79-81 (February 17, 2001).