My View

Richard Gayle

Lies and Statistics November 13, 2000

They always seem to have the flu shots around here 1 week too late for me. Last week was not fun so I will apologize if this column is not as coherent as it should be. The lingering effects of too little sleep.

I went looking on the Internet for a good quote on statistics a few weeks ago. I used Google.com, my favorite search engine. To my delight, there was a great quote from someone I knew personally (Ludwig Boltzmann, who spent much of his life studying statistical mechanics, died in 1906, by his own hand. Paul Ehrenfest, carrying on the same work, died similarly in 1933. Now it is our turn to study statistical mechanics. -David L. Goodstein) David Goodstein was one of the professors who taught me physics at CalTech. I knew I wanted to work in biology so I never tried to take the A track class taught by Feynman. I took the B track class.

Goodstein provided one of the most valuable lessons on research I ever learned. His wife was the university archivist and had access to old lab notebooks from years ago. Robert Millikan, the first president of CalTech and one of its most influential scientists, was a Nobel Prize winner. He performed a famous experiment called the oil drop experiment that helped determine the value of the electric charge of an electron.

Now, this experiment involved making lots of real-time observations. His lab notebook is loaded with tables of these observations. Next to many of them are notations such as "Data no good", "Data too high", etc. But next to one set was something like "Perfect. Publish." For the first time I realized that scientists do not just observe the world. The real world is often too complex to do this. A scientist is often observing a simplified world in an experiment and must be able to separate wheat from chaff, signal from noise. They have to provide some interpretation of their work. (The Internet is great since I found a nice discussion of this very topic.)

In fact, many people were trying to do the same thing that Millikan did. He succeeded because he understood many of the technical problems that produced aberrant data points. Including these data, as some of his contemporaries did, would make the determination of the correct value impossible. There was really nothing for the casual observer to discriminate between Millikan's good data and the bad. He had the experience to do so and did. In fact, even including much of the data he discarded did not affect the value he calculated, only the variance around it. Better understanding of statistics might have allowed him to use more data, but it would not have changed the outcome of his results.

Of course the other lesson I learned is that sometimes the difference between fraud and truth is slight. Others have felt that Millikan committed fraud. Yet, he was able to choose the data because he had a firm handle on all the variables and how they affected the data. By controlling those, he eventually produced data that got to the truth. Observational bias can be a constant problem. Millikan was a genius. He could chose his data because he was right and knew why.

Another famous researcher has been suspected of cooking the books. However, this is another case of people not understanding what is really being measured and what the data are really telling us. Gregor Mendel published his data on the sorting of phenotypes between peas as a relatively informal presentation. In fact, all of his work consists of 2 papers and several letters. If he had known how much his work would be scrutinized over the next 150 years, I am sure he would have been more careful with his choice of words.

Now, Mendel published his work on peas. He chose peas because he had created a list of observable traits in these plants that could easily be measured. We have all seen the tall vs. short, wrinkled vs. round diagrams in text books. He narrowed the list down to 7 traits. Now guess how many chromosomes the pea has? Fourteen. Its haploid number is seven. Seven traits and seven chromosomes. Hum! And they are all on different chromosomes. Double hum! He just happened to pick 7 traits that are on 7 different chromosomes. Well, he was not just observing nature and his choice of traits was not random. He created a simplified model.

Mendel was interested in hybridizations. What happened when you took a plant or animal that was "pure" in one trait and crossed it with one bearing another "pure" trait? What happened when two forms of one trait were crossed? If he had chosen 2 traits that were on the same chromosome, their segregation would have been linked. There was no real concept of the chromosome as a genetic storehouse, of how genes close on one chromosome will stay together in subsequent generations. Mendel simplified his study, only looking at traits that always bred true but never together. He did not just observe generations of plants and enumerate their properties. He chose properties explicitly that would demonstrate what he needed to know. One could call this a real case of observational bias. Except that his simplification allowed him to observe things that are absolutely true and were against the accepted theories of the time. He had no idea that his traits were on different chromosome. He just wanted traits that stayed separate from each other.

The dogma at the time was that traits blended. If you took a tall plant and crossed it with a short one, you'd get a medium-sized one. Darwin in 'Origin of the Species' believed this was the case. It was the main basis of many of the miscegenation laws passed in the late 1800s (Some are still around. Alabama has one as part of its state constitution. They just voted on whether to remove it. They will.). People could see that the first generation was midway in color.

Mendel's great breakthrough was that he showed that you might only get 1 type in the first generation (if there was a dominant trait), but that both types would be present in the 2nd generation. In fact if you crossed hybrids, you would get, on average, one of the dominant, one of the recessive and two hybrids. This explained why you could have a plant (an F1 hybrid) that appeared to have the same dominant trait as a parent but that would generate offspring that showed either dominant or recessive traits. There was not an amalgamation.

Now, not every crossing gave you the standard ratios. The ratio of dominant to recessive forms in the 2nd generation will approach 3:1 but individual crosses will vary from these numbers. So, if you analyze Mendal's raw data, some interesting things can be observed. His numbers are too good! At least that is the premise of some authors. By one measure, the chance of Mendel getting as close a correspondence with the expected results is 0.00007%. Statisticians will say that because the odds are so low, he must have fudged the data. Real scientists know that observational bias is a likely reason, not outright fraud. Mendel did not do any blinding in his calls of phenotypes. How wrinkled did a pea need to be? What height made it a tall plant?

It would not take too much observation to see what the ratios would be in a perfect world. And, besides, Mendel was not really all that interested in the why of the numbers. There was no real understanding of how all this would work. He was interested in establishing basic rules of heredity that became known as Mendel's principles. These were that of segregation (that traits could be found in pairs, one of each pair being transferred from each parent to the offspring) and that of independent assortment (that several traits could be passed on to offspring separately and independently). That these two fundamental principles of heredity could only have been determined by choosing 7 traits on 7 different chromosome is what makes Mendel's work, at least to my mind, one of the greatest papers in biology. With no real rationale for his choices, he was able to create a simplified model that opened the way to actually seeing how things worked.

Recently, a nice editorial on fraud and lies in science, written by David Goodstein, who is now Provost at CalTech, was published in the Scientist.

Both Millikan and Mendel were trying to determine actual values, concrete concepts that, ideally, have a single result. There was a real number at the end. It gets much more difficult when we examine complex biological systems, such as gene expression or cell differentiation. It gets easier to fool ourselves. And, as our tools get better, things often get more complex, at least initially. However, enough of this has happened in other sciences, especially physics, to help give us some hints of how to proceed.

When Millikan started his work, it was 'known' that the atom was composed of electrons, protons and neutrons. The electrons orbited the nucleus much like planets did in our solar system. The proton, neutron and electron were real things, whose direct effects could be determined, like Millikan's oil drop experiment. The expectation was that, as tools improved, it would be possible to directly image these particles.

Of course, as the tools were developed, the complete opposite was seen. You could not actually 'see' an electron. You could simply determine the probability of its location. Same with anything at those small distances. This is the cause of a lot of the non-intuitive aspects of quantum physics. It is only when we look at large numbers of particles that the probabilities regarding location result in a 'real' substance. Because, when your sample is large enough, you can get mathematical certainty, even if you can not predict what a specific molecule will do. Just as you know that if you toss a fair coin a million times, 50% of the time it will be heads, even though you can not know what an individual toss will produce. (As an aside, I would make the case that the Presidential election was essentially a random event, with equal likelihood that people would chose Gore or Bush. A truly 50/50 probability would give us results pretty near what we have seen. So, while individuals might have been able to 'predict' what they would do, as a group the choice was random. We might as well have tossed coins to decide. Think I can get on Larry King Live with this theory ;-)

Now, why am I talking so much about probability and statistics? Well, it is becoming increasingly obvious that many biological processes involve a probabilistic mechanism, not a determinant one. That is, you can not say 'If X happens, Y will happen.' You can only say 'If X happens, there is a Z probability that Y will happen.' Depending on conditions, Z can vary tremendously. These sorts of processes are called stochastic processes. These incorporate a probability function over a period of time.

So, what sorts of biological stochastic processes am I talking about? Well, one that will be of interest to us involves hematopoietic stem cells (HSC). An HSC can divide and go down a differentiation pathway or it can replicate and stay an HSC. How does it know what to do and what molecules drive it down one pathway or another?

Well, it is looking like this process may be stochastic. An HSC may have a probability of existing in 1 of 2 states, one leading to differentiation and one leading to self-renewal. The probabilities may differ under different circumstances. You could take 2 HSC that are initially identical, put them in the same environment and, due to stochastic processes, one would self-renew and the other might differentiate.

This is still controversial but what it means is that you can only deal with statements like 30% of the HSC will self-renew and 70% will differentiate under these conditions. You can not say that a specific HSC will go along a particular pathway. This is somewhat nonintuituve. Most of us look at these sorts of things as deterministic. If we place identical cells in identical surroundings, they will act identically. And it should be repeatable. If we do it again, we should get the same result. May not be true.

There have been a slew of papers in the last year examining this idea. Some great titles that are not online are: 'Do Stem Cells Play Dice?', 'To Be or Not Be Active', 'Nature versus Nurture in T cell cytokine Production.' Now, no one is saying that all processes are stochastic, driven by some non-zero probability. Many things are probably deterministic but there appears to be a substantial amount that are driven by randomness. A receptor gene just might turn on for a short period of time because it has a low probability to express even in repressed circumstances. This receptor on the cell surface now binds a cytokine which results in a signaling cascade that drives this cell down some pathway. Another cell, that initially was the same but for this one random event, does not follow this path.

OK, enough rambling, let's talk papers. Two recent articles in Blood examine this problem. One is examines the kinetics of HSC repopulation in mice. It was written by a group at the UW. They took lethally irradiated mice and transplanted in marrow cells from 2 different donors: 4 x 105 Gpi-1b cells and 0.25 x 105 Gpi-1a cells. They then examined the proportion of Gpi-1a granulocytes, platelets, T cells and B cells at 6, 15 and 30 weeks post-transplantation.

So, obviously any Gpi-1a cells present in the mice at 30 weeks must have come from some stem cells present in the original transplantation. What this group was able to do was use a procedure called stochastic modeling to work backwards. They could use the data they generated to determine some basic facts about murine stem cells, even though they never isolated any. Examining the properties of isolated stem cells are sometimes problematic. First, are these cells actually HSC and not some other population? Further, what effects have in vitro manipulations had on their fundamental properties? By using this approach, they can identify properties of HSC without ever having to actually see one. Isn't math great?

They made a very simple model. An HSC can replicate or self-renew with a certain rate (l), It can differentiate (n) or it can die (a). Once it differentiates, a clone can contribute to hematopoiesis until it is exhausted (m). Their model assumes that HSC act independently from one another, that everything that happens to them is stochastic and that all clones that contribute to hematopoiesis contribute equally. Since, the fate of the HSC can be expressed as a probability function, statistical methods can be used to simulate the experimental data, in order to determine which values of l, n, a, m best fit the data. The model can also identify the number of HSC present in the initial transplantation, R0. Finally, the parameters could be checked against the repopulation of B and T cells.

So, they may not be able to isolate an HSC but they can tell what many of its properties are by the effects it has on downstream processes. One of the interesting things they are trying to determine is what the effect of body size is on the properties of HSC. A 25 g mouse makes as many red blood cells in its entire life as a 70 kg human does in 1 day. So, do humans have more HSC or do they divide more often? The authors had previously performed a similar set of experiments on cats, so they had feline numbers to compare.

To make a long story short, they saw large differences in most of the values between cats and mice. The simulated data indicate that mice have somewhere between 4 and 22 HSC per 105 nucleated marrow cells, with 8 per 105 being the best fit. This compares with the feline best fit of 6 in 107 cells. So, HSC are much more plentiful in mice than in cats. Also, they replicate more often (1 per 2.5 wks vs. 1 per 8-10 wks), and differentiate more often (1 per 3.4 wks vs. 1 per 12.5 wks.)

These values fall pretty close to several, more classically determined, values. So, using a model based on stochastic processes yields results that fit observations. And it can do more. The model suggests that in larger animals, HSC are less frequent and replicate more slowly. Therefore, their proliferative potential in large animals has to be much greater than in small animals. That is, each HSC will produce more differentiated clones in its lifetime in larger animals. So, either there are very basic differences in the biology of HSC in mice and cats (and presumably humans ) or murine HSC have an excess capacity to support hematopoiesis that is not normally used.

Another interesting aspect of this model is that if m is equal to 0, that is if clones never become exhausted, there are no values for any of the other parameters that can mimic the observed results. This would offer support for the idea of clonal succession, where hematopoiesis is maintained by several clones of HSC, each acting sequentially, with a definite lifetime, rather than the idea that a few clonal cells are active forever.

The other paper in Blood is more of a review but takes the stochastic nature of biology and looks for reasons. Most of us view transcription rates for a particular gene like the volume on an amplifier. If more gene product is needed, you crack up the dial, moving through 2, 5, 10, going all the way to 11 if needed (for those of you who are Spinal Tap fans). More and more analysis of gene expression in single cells indicates that this may not be true in all cases. Gene expression is binary, either on or off. There is no in-between. What controls the amount of transcript is the probability that transcription will occur, not the strength of that particular promoter. Another stochastic process.

Here is one way to look at it. In the analog mode, 10 cells will have 1 transcript. Crank up expression, and each cell has 10 transcripts. In the digital model, 1 of the 10 cells will have 10 transcripts. Crank up expression, and each cell has 10 transcripts. So the overall number of transcripts will not be different and if you examine a population of cells you can not see any difference. But if you examine individual cells, then the binary nature of many genes starts to appear.

I'm not going to go into all the evidence this week about the stochastic nature of gene expression. It does explain a whole slew of observances, such as mosaic expression, monoallelic expression, haploinsufficiency. What the review explores is how do the cells adjust the probability of transcription so that they can react to circumstances. It has some interesting consequences for a stochastic view of life.

A binary approach to transcription leaves little room for control, unless there are processes that will change the probability of transcription of a gene. The elements responsible for this control may be enhancers. Enhancers are cis-acting elements in the DNA that appear to increase the probability of gene transcription. The more enhancers that are occupied the greater the probability that the gene will be expressed. So, binding of one enhancer may reflect one set of circumstances, the binding of another may reflect a second set. If each event increases the chance of transcription 5 fold, then, if both are occupied, the chances go up 25-fold. Probabilities multiply, particularly if the two events are independent. Often, if two stimuli act multiplicatively and synergize, it is taken as evidence for a direct linkage between the two, such as protein-protein interactions. But a probabilistic view would say that the synergism is simply a direct consequence of the independent, stochastic nature of the event.

So, what becomes important for the life of a cell is how often a pulse of mRNA comes along, how stable the mRNA is and how stable the protein is. If the mRNA and protein are very stable, then the pulses can be well separated. There would only need to be a small amount of new mRNA made to keep steady-state levels stable. But what if the mRNA half-life is much shorter than the time between mRNA pulses? Then the level of mRNA and protein seen in individual cells will simply reflect the time since the last pulse and not some intrinsic difference between cells. Subpopluations based on gene expression may, in some cases, be nothing more than the results of stochastic processes.

See, probability can not tell us what an individual event will be. A stochastic approach allows us to determine the probability of what a group of cells might do, but it can not tell us what individual cells will do or look like. Stem cell differentiation may simply proceed from the random expression of a set of genes. There may only be a 5% chance that any cell will express this set, but if there are several hundred cells, then one cell gets to be the lucky dog. Just like in the Lottery.