Lecture: Statistics in Medicine Part II: Interpreting Probability Values

During this live lecture we discuss the following:

  • Concepts of p values, confidence intervals, statistical and clinical significance
  • Examples to demonstrate interplay of these factors
  • Journal examples to define approach to interpreting clinically useful results

Lecturer: Dr. Vivek Dave MD, DNB, FRCS, FICO


(To translate please select your language to the right of this page)

DR DAVE: So good afternoon, everybody. I am Dr. Vivek Dave. I am from India. I am a retina surgeon based at the LV Prasad Eye Institute in Hyderabad, in India. I have a keen interest in biomedical statistics, and I am a strong believer that basic biomedical statistical knowledge is of the most importance to clinicians all around the world. And for most of our basic research in day-to-day life, basic research related to our clinics, we really require just a baseline knowledge of basic biomedical statistics. We really do not require a lot of high fundamental statistics which require us to involve a lot of statisticians. So what we’re going to do today is discuss four basic biomedical statistical entities, namely the p-value, statistical significance and clinical significance, and we’ll see how they apply to each other when we go to the literature, so we can get meaningful results out of the literature which we can apply to our clinic. So over the years it’s always been if you pick up a journal article, you go through the language and you see p-value, confidence intervals, statistical significance. So as a clinician, this is not in our comfort zone. I hope to relate these in a way that arouses interest in the audience within these concepts, and hope that when you go back and read literature, you are able to apply these concepts and understand what they are going through. So as we begin, I will put you through some poll questions and start off with the first poll question, which is demographic in nature. I would like to know what is your position in the audience, in terms of you being an ophthalmologist, an ophthalmologist in training, a nursing staff, an ophthalmic technician, or a medical student. All rights. All right. So as I understand, most of our audience comprises ophthalmologists and ophthalmic technicians. So a lot of you are from the field where I practice. So I assume it will be easier for you to understand the concepts from my point of view as we go through the presentation. So another interesting poll question. I’m sure all of you are seeing a photograph on your screen. So if you could see the photograph and just tell me: What do you see on the screen? Is it a monster in the sky? Is it a bug on the road? Is it a bug in the sky? Or you think it’s none of the above?

DR DAVE: So if you look at this photograph, it’s palpable — it apparently looks like it’s a monster on the screen, out of some movie, but actually what it is — it’s a small bug which is on a wind screen. My purpose of putting this photograph across: This is a lot like statistics. So if you do not understand the concepts clearly, you will probably miss most of the picture and interpret things in a wrong way. So this is the incomplete picture, which gives us a different view, and this is a complete picture, which actually gives us a completely different perspective. So for today, going to p values, confidence intervals, clinical and statistics significance — we’ll start off first with the p-values. So to understand that, I’m sure most of us are aware of these two pain killers, which I have listed. One being paracetamol, and the other being morphine. Now, all of us are aware that paracetamol is a routine nonsteroidal antiinflammatory drug, and it is the basic pain killer available to us and easily available over the counter. Whereas morphine is a narcotic. Now, morphine definitely reduces pain to a much greater aspect than paracetamol. But how do we know that? We know that because over the years, studies have been conducted which have clearly shown that morphine — and mark my words, morphine has a greater probability to reduce pain than paracetamol. More than what can be explained just by chance. So basically what this statement says is: Whenever you have two entities which are comparable, and you see a difference between these two entities, there actually can be two situations. One in which the difference between these two entities actually exists, and the second is when the difference between these two entities has just occurred by chance. So what a p-value does is it measures probability. It measures probability of any observed difference that you are seeing to have happened by chance. So whenever you go through literature, you will see the p-value being mentioned as a fraction. It is zero-point-something. So if there is a p-value which equates to 0.5, it basically tells us that there is a 50% probability of the observed difference happening by chance. That means a 50% chance than whatever you are seeing that whatever you are seeing has actually occurred and a 50% chance that whatever you are seeing has just occurred by a matter of luck. So as you see different p values, you realize that they indicate different probabilities of the difference occurring. Now, in statistics, there is a value called P=0.05. So when you say P=0.05, it means there’s a 5% probability that the observed difference that you are seeing in your study is happening by chance. That means you have conducted a study, you have found a difference between the two groups, and you have found a p-value which is 0.05. So it tells you that whatever difference you have found, a 95% probability that that difference actually exists, and only a 5% probability that that difference is occurring by chance. Now, this is a value which in literature is accepted as significant. That means whenever you are conducting a study and you can prove that you have a particular difference, and the p-value for that difference is less than 0.05, it tells you that whatever has occurred has actually occurred, and it occurring by chance is very minuscule. So this is something which is acceptable. Any p-value which is greater than this value is assumed to not have occurred in the real sense, and probably have occurred by chance. Now, this is only arbitrary. It is up to the researcher to decide what p-value he or she wants to put across as significant or non-significant. So as a general principle, the smaller the p-value that you signify, the greater will be the strength of your study, because that small error you are accepting in your study. But in general, most studies will have a 5% probability cutoff. So again, just to understand the concept further, if there is a study which has a p-value of 0.0032 in a given difference, it means 0.0032, or 0.32% probability that the difference has occurred by chance. That means 99.68% chance that the difference has actually existed. That means the smaller the p-value, the better is your result. So this is something as a take-home message that the audience should take. That any p-value which is less than 0.05 is considered as a statistically significant difference. I would stress on the word “statistically significant difference”, because it has a completely different meaning to it from something called a clinical significance, which we will come to further. After having understood the basics of a p-value, let us understand something called the confidence intervals. This is something which is a little difficult to grasp. Hence we’ll try and solve it by an example. So let us say that we are present in a neonatal intensive care unit, and there we have many small infants. Now let us say that from all the infants that are present in that neonatal care, we pick up 30 infants. We pick up those 30 infants and measure all those birth weights. If we measure all those birth weights, add them together, and divide by 30, we will get a mathematical value which is called the mean or the average. So let us say for example the mean or average of birth weight of those 30 selected infants is 2234 grams. Now, these 30 were selected from the entire population of infants that were present in that given hospital. Now, that 30 actually represents the entire lot of infants in that hospital. It is a sample from the entire lot of infants in that hospital. Now that entire lot of the hospital will also have its own mean. Correct? Now, that mean need not be necessarily equal to the mean of these 30 infants, because these 30 infants will have their own value, in terms of the mean, whereas the whole lot will have its own mean. So the mean weight of these 30 infants and the mean weight of the entire lot of infants is going to be different. That difference is termed as a small sampling error. That difference occurred because you chose a small sample from the entire lot. So you induced a small error in calculation of the mean. Why you would pick up a small lot is because, practically, in a given study, you cannot include the whole population at risk. So, for example, sitting here in India, if I want to take a gauge of how many diabetics actually have their eyes affected due to diabetes — suppose I want to calculate the prevalence — it is not possible for me to go around and check each and every Indian and check who is diabetic and then check their eyes and whose eyes are affected. It is not possible. Because we are over a billion population. So we take as large a number as we feasibly can take, and examine them for the study question. And then we give our values with something called confidence intervals. Confidence intervals basically help us extrapolate to the population. So we will now see how confidence intervals are calculated. So as I said, whenever we take a small sample from the whole population, and calculate its mean, it will be somewhere close to the mean of the entire population, but will not be equal to that mean. So this difference is what is called as a sampling error. So confidence interval actually is related to this sampling error. This sampling error is called, in statistical terms, as the standard error of the mean. So to calculate the standard error of the mean, what we do is: Just going back to the same example, suppose we have a total of a hundred infants in this given hospital. And we have picked up one lot of 30 infants. So suppose we pick another lot of 30 infants, and then another lot of 30 infants, and then one more lot of 10 infants. So now we have four samples of 30, 30, 30, and 10. Each of them will have their own mean. So if we calculate the mean of each of them, and then add all those means and divide by 4, because we had four samples, we will get what is called as a standard error. Mathematically speaking, standard error is calculated by dividing the standard deviation by the root of the total number of people that you have measured. Now, this is the mathematical calculation of standard error. How confidence interval is related to the standard error is that whenever you get a standard error and you multiply it by 2, you get a particular value. So that value actually represents the spread of all possible means in that given sample. And this is what is called the confidence interval. What I have explained in the past three or four slides is the mathematical derivation of confidence interval. It is really not something which requires to be calculated, and there are freely available online calculators that help you calculate confidence intervals. So basically, to summarize, whenever you are doing a study which is comparing two samples, or which is comparing two entities, we calculate what is the difference that we get between the two measurements. Once you calculate the difference, you calculate the p-value for the difference. This can be easily calculated by online statistical calculators. As we studied in the first three, four slides, whenever the p-value is less than 0.05, these results can be considered as statistically significant, because it indicates that there is less than 5% probability that the difference that you got occurred by chance. That means there is more than 95% probability that the difference that you got in your study actually exists out there. Now, once the p-value has been calculated, one should also calculate the confidence interval around the means that you calculated in the difference. So whenever you have one value for one sample, when you calculate a confidence interval, it gives you a range around that value. For example, suppose you calculate the intraocular pressure drop by a particular medication in one group. And by a second medication in the second group. The intraocular pressure drop by the first medication in the first group is 8 millimeters. The confidence interval around it is a range, as I told you, so suppose that range is 4 to 10. What it means is: In your study, you got the IOP drop of 8 millimeters of mercury. Now, your study had a particular number of people. Suppose you extrapolate this study with the same study setting to the entire population at large. In the worst case scenario, your IOP drop that you may achieve can be as low as 4 millimeters, or in the best case scenario, your IOP drop may be as high as 8 millimeters. That means the 6 millimeter drop that you got in your study, in the real world, could have been as low as 4, and as high as 8. So this is what confidence interval does for you. It basically tries to overcome the error that you have induced in your measurement because you took a small sample and did not take the whole population. So a confidence interval gives you a range of the mean, which can be extrapolated to the population. So as against the means of the two groups, which will be point estimates, the confidence intervals will be a range. So the 4 to 8, which was the range for the value of 6 millimeters, in the other group — let us assume that the IOP drop was 10 millimeters. So that itself may have a range which is 8 to 12. So that means, if the second medication in your study ensured an IOP drop of 10 millimeters of mercury, the same thing, if you repeat with the same study settings in the entire population, your values could be as low as 8 or as high as 12 millimeters. So whenever one is trying to interpret whether a difference actually exists, one tool that we have is the p-value. So if the p-value is less than 0.05, we can see that statistically the difference exists. But a small catch here is: That even though the p-value is significant, it is very important to ensure that the confidence intervals do not overlap. Because the confidence intervals are a range. If the two ranges of the two means in the two groups overlap each other, that means in the real world there is a possibility that irrespective of whether you use medication of the first group or of the second group, you may have scenarios where the overall outcome, in terms of the mean, may be the same. So we will try and understand these things further in a few examples. Before that, one last concept that I want to sort of put across, before we go to examples, is clinical significance. Whatever we discussed right now in terms of a significant p-value, and in terms of confidence intervals of the two means which do not overlap, is called as a statistical significance. Now, a statistical significance need not actually mean a clinical significance. A statistical significance is a mathematical value that you get out of calculations. Whereas a clinical significance is the intuition or the decision that the clinician makes in the clinic. So let me again give you a small example. Suppose we have two medications, which reduce intraocular pressure. The first standard routine medication that you’ve been using in your clinic left, right, and center is compared to a new molecule that comes in the market. Now, a study is conducted by the pharmaceutical company, and the results are presented to you. The first standard medication that you’ve been using left, right, and center — suppose that reduces the intraocular pressure by about 4 millimeters of mercury. The new medication that has come in the market is shown to reduce intraocular pressure by 5.5 millimeters of mercury. But the overall p-value for this study is 0.0005. Looking at the p-value by what we have studied ’til now indicates 0.0005 means a less than 0.05% chance — or a less than 0.05% probability that the difference has occurred just by chance. So this is a highly statistically significant result. Now, if you consider just the p-value, you will intuitively think that the new drug is better than the old drug, because the difference is so highly statistically significant. Before you interpret it that way, and take the new drug to the clinic, it is important to understand what is the overall quantum of clinical advantage that you got by the second medication. The first medication, as we studied, reduced the intraocular pressure by 4 millimeters. The second medication reduced it by 5.5 millimeters. So it is a mere increase of 1.5 millimeters of advantage. Now, all of the audience who has ever measured intraocular pressure on the standard Goldmann applanation tonometer would very well know that 4 and 5 or 4 and 5.5 is such a small difference that sometimes clinically you may mix up the two. That means, even if you use the new drug, there is hardly any difference in the overall quantum of advantage that you got over the older. So irrespective of what statistical outcome you got, this does not seem to appeal to your clinical sense. Hence you would say that the results, though statistically significant, are clinically insignificant. So this is a very important concept. That statistical significance is a mathematical calculation, whereas clinical significance is actually the clinician applying his or her own experience and intuition to derive whether the quantum of change is good enough to be applied to the clinic. So at this point, let us take the next poll question and see how we have understood whatever we are studying ’til now. So assume that two studies have been conducted. In the first study, the p-value for the difference in outcome is 0.056. Whereas in the second study, the p-value for the difference in outcome is 0.047. So what do you interpret? The first has a highly significant difference? The second study has a highly significant difference? The difference, though present, is not much? Or you would want more information to interpret the final results? All right. So most of you have got it correct. That we would require more information to interpret the final results, because the question does not mention confidence intervals. And we just stated that even though we have the p-value known to us, it is very important to interpret it in the light of confidence intervals. So we understood what is statistically significant and clinically significant value. We now will go through a couple of hypothetical examples, and then we will see how we apply whatever we have learned to real life journal articles. Now, we all have seen coins, and we know that they are non-weighted. When we say non-weighted, what we mean to say is that the coin is fair. It has a heads on one side. It has a tails on the other side. And whenever you flip it, a fair coin would give you a 50/50 chance of either getting a heads or getting a tails. So this is the intuitive probability that we all know. So whenever you are tossing a coin, let us assume that our outcome variable is the number of times that you get a head. Now, by intuitive probability, we know that the p-value for the difference between getting a heads or a tails is 0.5. That means 50%. Now, let us assume that you have a coin that I give you, and you are expected to toss the coin a hundred times, to check whether it is a non-weighted coin — that means it can give you an equivalent number of heads and tails — or it is a weighted coin. That means it will preferentially give you more tails or more heads. So let us assume that you toss the coin a hundred times. Though it is intuitive that the chance of getting either a head or a tail is 50/50, it is not necessary that you will have that happen in nature. So you have tossed it 100 times, and what you have got is 45 heads and 55 tails. Now, again, if you go to any online statistical calculator, and check for something called as p-value for a difference between 45% and 55%, for 100 events, I would say 100 events — because we tossed the coin 100 times — you would get a p-value which is 0.31. So if we stop our interpretation at this point, what we say is: The difference between the number of times that you get heads and the number of times that you get tails is not statistically significant, going just by the p-value, because the p-value is 0.3. When the p-value is 0.3, it means there is a 30% probability that whatever difference has occurred has occurred by chance. And we have just learned that the p-value that is acceptable is less than 0.05. Now, this is the interesting part. Now, let us repeat this a multiple number of times. This will help us understand the concept of p-value vis-a-vis confidence intervals. So the first column shows you the number of times that you got heads, in terms of percentage. So we have just seen that — suppose we got heads 45% of times. The second column shows you the number of events that you did. So in the first experiment, we tossed the coin 100 times. We kept on repeating, and did it 200 times, 300 times, 500 times, and 1,000 times. Now, for the sake of understanding, let us assume that every time heads was 45%. Now, if the heads was 45%, you have to now see what is the interplay between p-value, confidence intervals, and the number of events. So in the first row, we have heads occurring 45% of times. 100 events had occurred. So for this 45%, if you calculate confidence intervals, again, very easily calculated by online statistical calculators, you will get a confidence interval ranging between 35.25% to 54.75%. What it means is: The given coin, when tossed 100 times, gave you heads 45% of times. But suppose you keep tossing it again and again, again and again, your results are likely to lie between 35% heads to 55% heads. So I hope this concept is really percolating well. So for a 45% heads, there is going to be a 55% tails. Now, for the 55% tails, for 100 events, the confidence intervals, if you calculate for this, it’ll be 45 to 64. That means if you keep tossing this coin again and again, in a real life situation, the chances of you getting a tails could be as low as 45%, and as high as 64%. And the p-value that has just been calculated is 0.32. Now, just for a moment concentrate on the column 4, row 1, which is confidence intervals for heads. And column 5, row 1, which is confidence intervals for tails. If you see, the highest value or highest probability percentage for heads is greater than the lowest one for the tails. That means there is an overlap. So when you have a range, 35 to 54, and the second range, 45 to 64, it shows an overlap. When it shows an overlap, what it basically means is: There is a definite possibility that you could have a scenario where each may be 50%. Because 50 is a common value between these two ranges. So putting this information together, we can interpret that if this particular coin is tossed 100 times, and you get heads 45% of times, that means you get tails 55% of times, the confidence intervals for these percentages actually overlap. They mean that in the real life scenario, there could be an equality that you may gain in one of the experiments. Hence there actually is no intuitive difference, and we can safely say that this coin is non-weighted, or it’s a fair coin. Now, suppose one is not satisfied with the results, and one is sure that this coin is a weighted coin, because it is consistently giving you heads only 45% of times. But how do you prove it? Because you could not in this given experiment. The best way to prove it? Increase the number of events. So see what happens with the same coin, but instead of 100, you repeat it 200 times. So when you repeat it 200 times, and the heads are still 45%, that means the tails is still 55%. Just observe what happens to the confidence interval. What was 35 to 54 has now become 38 to 51. That means the range has reduced. What was 45 to 64 has become 48 to 61. That means the range has reduced here too. And if you look at the p-value for the difference, it’s come down from 0.3 to 0.16. We just learned in the initial slides that to call a concept as statistically significant, it has to be less than 0.05. So it still stays in a statistically non-significant level, but it has definitely reduced. You can still see that 38 to 51 and 48 to 61 has 50 as an overlap. That means, in the real life scenario, there can still be a possibility that you may get 50% heads and 50% tails. Now, the same thing you keep repeating more and more number of times, you will see that the difference between the upper and the lower values of the confidence intervals keep reducing. This in statistical terms is called as the confidence interval is getting tighter and tighter. ’til you reach a point, which happens at 500 events, that the confidence intervals no longer overlap, and at the same time, you will see that the p-value has converted from non-significant 0.08 to significant 0.02. Same thing if you repeat still further, a thousand number of times. You will see that the p-value further drastically drops, and the confidence intervals keep drifting tighter. This is a very important concept that one should understand. Whenever there is a study which does not have approval, statistical difference the statistical difference can still be proven by bumping up the study with a large number of recruits. This is important, pertinent to the first results that I gave you, in terms of a glaucoma medication which you are using left, right, and center, and a new medication that came into the market. So when the standard one is reducing it by 4 millimeters and the new one is reducing it by 5.5 millimeters, for a mere 1.5 millimeter drop, how would they show that the results are statistically significant? They would have shown it to be statistically different on the lines of this example that I have given you. They would have recruited 5,000 or 10,000 patients. So whenever the number of recruits is very high, even a small clinical difference can absolutely magnify in terms of the p-value and in terms of the confidence intervals, and look very, very impressively significant. So this is an important catch, which one should always keep in mind, when interpreting literature. Now, let us go to some journal articles quickly. So this is an article from Anesthesia, which compares ramosetron with ondansetron. So we would all be knowing that they are antiemetics. So the title of this journal article is: Comparison of ramosetron with ondansetron for the prevention of nausea and vomiting in high risk patients. So just going to the clip of the results in the abstract, the results say that the incidence of postoperative nausea and vomiting was found to be 35% in the ramosetron group as opposed to 43% in the ondansetron group, with a p-value of 0.19. So if you stop and interpret it, you understand that ramosetron had a lower incidence of nausea and vomiting than the other group, by about 7.7%. This is what is called a clinical difference. Statistically speaking, because the final p-value was not less than 0.05, you can take the results as statistically insignificant. Now, in a setting of the results being statistically insignificant, do you take this difference to the clinic? That means do you say that ondansetron is still betters than ramosetron because it is reducing nausea and vomiting 7.7%s more? This is based on the intuition of the clinician. If the clinician in his or her wisdom feels that 7.7% better result in my patient outcome is something that is tangible for me, then you take a clinical interpretation that ondansetron is better. If not, then you leave it here, saying that although there is a difference, we will not take it to the clinic, because the clinical difference is not appealing, and there is no statistical difference. In the same article here, we see the first row, which I would like you to concentrate on. So this indicates the number of patients that had nausea in the first 6 hours postsurgery. The numbers are the actual numbers, and the brackets are the percentages. As mentioned, there were 103 patients included in each group. 35% of the patients in the ramosetron group had nausea in the first six hours, whereas 38.8% in the ondansetron group had nausea in the first 6 hours. Now, again, think in terms of clinical difference. 35% nausea and vomiting, and 38.8% nausea and vomiting is really not too much of a difference. Comparatively, the p-value is also not clinically — it’s also not statistically significant. So just stopping at this line, the interpretation is enough for us to tell ourselves that the difference between these two is not palpably enough for us to change our practice pattern from ramosetron to ondansetron or from ondansetron to ramosetron. Now, let us assume I am the person who makes ramosetron. Now in this study I showed that it causes 3.8% less nausea compared to ondansetron, but the clinicians elsewhere are using ondansetron. As a pharmaceutical company, if I want them to shift to ramosetron, I need to show a significant p-value. But stopping here, all I got is 0.5, which is not significant. So let us do the same exercise again. The first row is showing the total number of patients, as shown in this study, and the incidence of nausea and vomiting in the ramosetron and the ondansetron group, and the confidence intervals that we calculated around each of these figures — the p-value as mentioned in the article itself is insignificant. Just like the coin experiment, if instead of 103, you recruit 206 patients, just see what happens to the confidence intervals. They start getting tighter. What happens to the p values? They start falling down. So the pharmaceutical company will classically go on and on and on, and probably recruit 2,000 patients. So what happens when you recruit so many patients? The confidence intervals stop overlapping. And the p-value falls into significance. So this is the point at which the pharmaceutical company would probably approach you with the new drug, and say that our results are significant. We tested them on over 2,000 patients. And look at our p-value. So our drug definitely causes less nausea than the standard drug. But you as a clinician need to look and read between the lines, and check that the difference actually is only 3.8%. Which is not palpably much, and hence there may not be a pressing need to change your practice pattern from ondansetron to ramosetron. Let us look at one more study, where there was based on spinal anesthesia. So this was a randomized double blind controlled study where dexmedetomidine was used as an adjuvant agent. So we have the new drug, which is group D, and we have a group N, which is normal saline. Now, if you look at the first column, here you would see that the duration of sensory block — that means the duration of anesthesia — in the new drug group was 430 minutes. Whereas that in the standard normal saline control was 300 minutes. So if you just stop here, and not even look at the p values, you gain what is called as a clinically significant result. In that your standard practice pattern is giving you a sensory block of 300 minutes, whereas the new drug is giving you a sensory block of 430 minutes. That is an advantage of 130 minutes, which is two hours. So if you are operating on a patient, and you get two hours of extra pain relief, you will definitely want to shift to that medication, unless something prohibits you. Now, this difference has been proven in the study to be statistically significant. Right down there on the slide, I have mentioned the number of patients that were present in each group in this study, which was 20 each. Now, the confidence intervals on 430 or 301 have not been given in the study, but they can be easily calculated. So if you see when we calculate the confidence intervals, you definitely find that the confidence intervals are not overlapping. This is hand in hand with the fact that the p values are statistically significant. So overall you find a big clinical difference. You find a very small p-value. And you find confidence intervals which don’t overlap. All of these things together tell us that this drug is definitely better than your standard practice pattern. But just see what happens to the figures if, instead of 20 patients, you have recruited only 10. So if you calculate the confidence intervals of the same results where n is equal to 10 in each group, as against 20, you see that the confidence intervals overlap, and p falls down to 0.1. So suppose this would have been your original experiment. What you would have interpreted is that I definitely am getting a clinically relevant difference, or a clinically significant difference, but I am not able to prove it statistically. So I would like to go ahead and recruit more patients, and see what happens. This is what will happen when you recruit more patients, that your p-value largely will drop. So this was another example, to show you the interplay between p-value, confidence intervals, and the number of recruits. One last study that we will go through, again, to understand this concept, is an article from Retina. So in this article, a drug called aflibercept was given in the eye for age-related macular degeneration, and we saw what was the improvement in visual acuity. I would like the audience to calculate the first line. So it shows visual acuity on logMAR scale was 0.57 log, plus or minus 0.36, the standard deviation. If you compare with follow-up, it was 0.47 plus or minus 0.32 standard deviation. So when the logMAR visual acuity reduces from 0.36 to 0.47, it means it is improved. The p-value is less than 0.005, which means it’s statistically significant. So the p-value is 0.004. Confidence intervals have not been given in the article. The authors have mentioned in the text that a significant improvement in visual acuity was observed at 6 months. So according to the authors, from 0.57 to 47, that is significant. That is what is mentioned in the article. But if you calculate the confidence intervals for this particular difference, you will find that the confidence intervals overlap. This means that although the p-value is significant, the results may not be statistically significant. This is a very important interpretation, which actually puts forth the confidence intervals and the importance of not interpreting a p-value in isolation. Going further, the value 0.57 log and 0.47 log actually mean visual acuity a little worse than 6/18 before treatment and a little better than 6/18 after treatment. That means on the visual scale, actually, the vision of these patients has just hovered around 6/18, from a little worse to a little better. This intuitively tells the clinician that even from a clinical point of view, there has not been a major improvement. This difference is what is called as a lack of clinical significance. So whenever you see this, one can completely disregard what is the final p-value, and just interpret the study as something which is not of clinical significance. So, finally, I would like to conclude by saying whenever you are conducting a study, one has to go in a particular order, in which you compare the two groups to get an absolute difference, see what is the total number of recruits which have entered the study, calculate the p-value, calculate the confidence intervals around a difference that you get, and derive statistical significance, but set your own clinical significance, which is based on your intuition and your experience. So this is a photograph where both the ladies have a same interpretation about the world. But the way they approach is completely different. So this is what statistics is. Where you see what you want, and you interpret what you feel. It’s a basic interplay of numbers, wherein you can tweak the results according to your fancies, but whenever you tweak, the clinical significance still stays the same, and it is up to the clinician to interpret that from between the lines. I will end with a last poll question, which will probably give us an idea of what we have understood. So here we have a cholesterol-lowering medication. There are three types. Drug A, drug B, and drug C. Drug A and B are cheaper medications, whereas drug C is an expensive medication. So the column shows the number of patients that entered in the study, when each group was being — each drug was being tested. The next column shows the drop in the overall mean cholesterol, after the drug was given. That means after drug A was given, there was a drop of mean cholesterol by 40 milligrams. After drug B, in one study, it was 20 milligrams. In the other study, the drop was only 2 milligrams, and in drop C, the drop was 5 milligrams of cholesterol. The last column shows you the p values, and the second to last column shows you the 95 confidence intervals for the difference. So going by this, which do you think is the best drug which can be taken to the clinic? All right. So excellently well done. Over 60% have gotten the correct answer, that drug A is the better drug. The way we interpret this here is the fact that — look at the quantum of difference. That is the first thing one should be looking at. So the quantum of difference on the clinical significance of drug A is twice that of drug B, the best result that you got out of drug B, and it is eight times that of drug C. So even though drug C has a very impressive p-value, just look at the number of patients that they have recruited. So to show a small benefit of 5 milligrams cholesterol drop, they have recruited a truck load of patients and shown a very significant p-value. So this is sort of saying that the study is significant, when actually it really is not. The drug B, if you see in both the studies where they recruited less number of patients or when they recruited a high number of patients, really did not get any significant p-value difference. Whereas drug A, irrespective of whether they recruited a lower number of patients or a higher number of patients, consistently got a very good clinical significance, which could not be proven when the number of patients were less, but when the trial was repeated again with the higher number of patients could be proven very well. So I hope today’s talk has been beneficial, and you have understood the interplay between p-value, confidence intervals, statistical and clinical significance. And thank you for your kind attention, and I would be happy to take questions at the end.

DR DAVE: Right. So I got a question that says: What if the p-value and the confidence interval do not interpret the same outcome? Usually p-value and confidence intervals go absolutely hand in hand. That means if the p-value is going farther and farther away from significance, so will the confident intervals overlap more and more. If they go in the opposite direction, that means the p-value is getting significant, but the confidence intervals are really not following suit, and they are still overlapping more and more, it usually indicates that you have applied a wrong test. So if I can just deviate from our basic lecture, one of the basic premises of research is that whenever you collect data, you have to have a distribution of the data. That means you have all the values that you enter in your Excel sheet. Suppose there are visual acuities or the ages of a given set of patients. We need to see whether they have something called as a normal distribution, or they have a non-normal distribution. There are specific tests that need to be used when the distribution is normal, and specific others which need to be used when the distribution is non-normal. Suppose you mix them up, and you use the wrong test. You will still get a significant or a non-significant outcome, but there the p values and the confidence intervals may not go hand in hand. So whenever you have the p values and the confidence intervals, going in the opposite direction, always take one step back and see whether you applied a wrong test. So I hope I answered your question. I’ll probably go to the next question, which says: Will the p-value always decrease when the sample size is increased? Again, if the rest of the study parameters and the study conditions remain the same, that means you conduct a particular study, and whenever you conduct that study, you will have a set inclusion and exclusion criteria, and you will have a set methodology — as far as you keep your inclusion/exclusion criteria and your methodology the same, p values will decrease, if the sample size increases. In fact, the sample size is probably the single biggest factor which affects the p-value. And this is very, very important, whenever you want to interpret a given study which has a very highly significant p-value. Always look back on the sample size, and see if the sample size is very large. Using a very large sample to get a very good p-value is called in statistical terms as overpowering a study. I hope that answers your question. So another question that I have right now is: How does a clinician reconcile p-value and clinical significance? So again a very important question. So I would put it in a way in which I would go ahead and interpret an abstract. So suppose I am reading a journal article. The first thing that I’ll go through is the abstract. The two things that I will look at in the abstract is: Among the two groups that are being compared, what is the absolute quantum of difference that has occurred? This is the first thing that I will look at. Now let us take a hypothetical example of, say, reduction of blood pressure. So suppose you have this study, which is comparing blood pressure reduction by a standard medication — say Lasix — and a blood pressure reduction by a new medication. It says that the blood pressure reduction by the standard medication is 10 millimeters of mercury. And the blood pressure reduction by the new medication is 12 millimeters of mercury. Now, once I reach this sentence, I will try to interpret the clinical significance of it. If my standard therapy, which I have been doing for years, is giving me an advantage or an improvement of 10 units, and a new thing comes in, and mind you, anything which is new is usually costlier, and that cost actually translates to further health costs to the patient. So if this new thing reduces or gives me an advantage of just 2 points, 2 millimeters more, this is where I judge as a clinician: Would I change my practice pattern for a 2 millimeter drop? Probably all of us would agree that we would not. So at this point you need not even look at the p-value. Because if the p-value is insignificant, it actually goes in sync with the fact that the clinical significance is not much. And if the p-value is significant, very, very high chances that this study is overpowered, and usually the study researchers would have included thousands of patients to make this 2 millimeter difference seem statistically very significant. As against this, if the new medication would reduce the blood pressure by, say, 15 or 20 millimeters of mercury, I would really say… Wow, this new drug is really working. Now I would look at the p-value and confidence intervals. If the p-value and the confidence intervals do not show clinical significance, it probably means that I have recruited or the researchers have recruited a lesser number of patients than they should have. This particular interpretation usually does not exist in literature, because the researchers by themselves would know that we have a big difference, in terms of clinical significance, but we are not getting a statistical significance. So even before the paper is put up for publication, they would probably recruit more patients and trying to show that the clinical significance is also statistically significant. So the way you reconcile is completely based on your intuition, and what according to you is something which is clinically significant, what according to you is something which you would like to apply to the clinic. I hope that answers your question. So the next question, I guess, is: Based on the sample size and given p-value in the samples, did you keep a conventional rule that no outliers were there, and in case of a skew, how would we interpret the outcome vis-a-vis p-value? That’s an excellent question. I kept an example that basically is not having any skew. That was just for the fact that I did not want to confuse the audience, and with the sample size and with the absence of outliers, basically, it kept things very simple. Even if there is a skew, the way you interpret the outcomes vis-a-vis the p-value, the confidence intervals, and the number of recruits is exactly the same in which I told you. That means you look at the difference, look at the clinical difference, then look at the p-value if the clinical difference appeals to you. If the clinical difference does not appeal to you, then you may or may not even consider the p-value. So what is important at the end of the day is a clinical significance, and what the p-value and the confidence intervals do are they just justify whatever clinical significance you’ve got in your study. So going to the next question… Antonio asks whether in 10 patients a good p-value is more significant than the same p-value in a thousand patients, and is it stronger? So I would answer it in a way that, if you are getting… I would assume when you meant a good p-value, you meant a p-value which is statistically significant. So if you get a statistically significant p-value for a difference of measurements in just 10 recruits, that itself tells you that including 10 patients is adequate enough to answer your study question, and show a difference. Any more recruits that you do will just improve the overall p-value of the study, probably will improve the value of the study in a statistical manner. As far as the clinical outcome is concerned, usually it would not matter too much whether you take 10 patients or 50 patients or a thousand patients, because your methodology and your two groups remain the same. But as you include patients more and more, your p-value will probably get stronger and stronger. So the face value of your study will definitely get better. So I would conclude this answer in a way — that if you get a significant p-value in just 10 recruits, you are sufficient to stop the study and say I have proven my point. But in case you want to recruit more patients, you are definitely free to recruit more and more patients, which in a numerical sense will probably make your study better and better, and look more appealing. I hope that answers your question. So the next question that I have is: When should I calculate confidence intervals? Should it be only when I have two groups? So this is not that you calculate confidence intervals only when you have two groups. Even when you have a single measurement, at that time, one can calculate a confidence interval. So one of my colleagues has already taken a class in which he would have shown you a table, where it says what was the overall risk of complications in a given cataract surgery, based on the number of surgeries that were done. So even when you are interpreting a single entity, you can still calculate confidence intervals online. That means: Suppose I am doing two cataract surgeries, and I get one complication. My complication rate is 50%. But in the real life scenario, I need to calculate the confidence interval around 50%, or an n of 2. To give me a range which can be extrapolated to the entire population. This is going to be a wide range. It could be, for example, as low as 22% to as high as 76%. But if I keep doing my cataract surgeries again and again and my n increases, the confidence interval will get tighter and tighter. That is because the amount of extrapolation that you do intuitively keeps getting lesser and lesser, because you are doing more and more surgeries. So a confidence interval can be calculated for each and every measurement, and does not require to necessarily have two or more groups to apply it. The next question that I have is: In most studies, we see mostly only one confidence interval for the outcome to determine if confidence intervals didn’t overlap, compared to the second outcome, how can we go about? So if you have only one confidence interval for a given outcome, it is a very unusual scenario that you have mentioned. Studies will either not give a confidence interval at all, or will usually give confidence intervals for both the groups. But suppose both or any one of the confidence intervals is missing for a particular group. You require to go online, and go to confidence interval calculating statistical calculators, and the information that you require to calculate a confidence interval is the number of recruits in the study, the overall mean of the given group, and the standard deviation for the given group measurement. So mean, standard deviation, and the number of recruits are the three pieces of information that you require to calculate a confidence interval. And these three will definitely have been given for any measurement in a given study. So another question is: What does a minus confidence interval indicate? A minus confidence interval usually indicates — a minus confidence interval, rather, is a different way in which confidence intervals are represented. What I discussed with you today are: Two ranges of confidence intervals. Suppose you have two groups and two means. You have confidence interval around one mean. You have confidence interval around a second mean. And then, as I taught you, you need to see whether those two ranges overlap or not. Another way of interpreting confidence intervals is: Instead of calculating ranges of the two individual means, you take a difference between the two means. So that will be a single number. And then you calculate the confidence interval around that number. So you will get one range. Suppose in that range one value is in minus, and one value is in plus. What it indicates is that in the real life scenario, whatever difference you got between the two means could either be minus in some cases, or be plus in some cases. That means the results could be either one group being better or the other group being better. This is called as confidence interval straddling zero. I am repeating my point. Whenever you want to calculate confidence intervals, either you calculate around the two means individually, and then see whether the two ranges are overlapping, or a different way is you calculate the difference between the two means, which will be one single number, and then calculate the confidence interval around that difference. So you can have three possibilities. You can have the two values of that range, either both being in positive, or both being in negative, or one being in negative and one being in positive. In situations where both are positive, or both are negative, it indicates a significant confidence interval difference. In a situation where the difference shows values of one negative and one positive, it means that the difference of means could be — could go either way in a real life scenario. This is a tricky thing to understand, but I hope most of you are getting it. So if one is negative and one is positive, it means that the difference of the means is straddling zero. If it is straddling zero, that means, in a real life scenario, there can be a possible situation where the difference is actually zero. So that is the way you interpret confidence intervals, and that is the way you interpret or give meaning to a minus confidence interval. I’m sorry. I mixed up the questions. I’ll just go to the next question. Yeah. So there is one question, which says: How to interpret for a rare disease, where a small number of patients are included. An excellent question. So it is very important to understand that though for each and every piece of science, all of you, all of us, everybody would really want that it can be proven in a study format. In rare diseases, sometimes it is not possible. Because if the numbers are very small, you may not be able to show a really good statistical difference. So in rare diseases, more than showing a statistical proof in terms of a journal article, what is important is individual senior experiences. So in rare diseases, one may have a condition where you have 7 or 8 patients of a particular disease. Four of them you treat by a particular therapy. The other four you treat by a second therapy, and then you compare. The best you can probably do here is see what is the clinical significance. Is there a quantum of difference which is appealing to you, tangible to you, in either of the groups? You definitely are not likely to be able to prove it statistically. So the sort of… The Achilles heel for proving something statistically is, again, the n, which is something which really affects statistics. If you do not have adequate numbers, unfortunately, even if you have a clinical difference, you will not be able to prove it statistically. So in conclusion, for a rare disease, you cannot always fall back and prove a statistical difference. But you have to go by whatever clinical hunch that you have, or whatever clinical difference that you see, in case you conduct the study. I hope that answers the question. There is a question which says: Even though the clinical and statistical significance is not there, can we use confidence intervals and mean difference to explain the clinical difference? You absolutely can. Whenever you interpret a particular study, you should first zoom in on the quantum of difference that you see between the two groups, and not consider the p-value immediately. The moment the quantum of difference that you see appeals to you, it tells you that the thing is clinically significant. Once something is clinically significant, then the second step is to look at p values and confidence intervals in the study and see if they are indicating the same thing. So, to explain the clinical significance, one does not require a p-value. One does not require confidence intervals. Both of those are numerical things, which are used to prove statistical significance. Clinical significance is intuition, experience, decision of the treating clinician. I hope that answers the question. When do we say the p-value is significant? This is a harder concept. I will share my mail with you. It will take some time for me to explain to you what is a significant p-value. But I will definitely get back to you on mail. There is one question which says: Is it possible to have a confidence interval which is higher than the mean? And the example which has been put up is: 10.5 plus or minus 12. So I would just like to correct the person who has a question here. That what you have put up is not a confidence interval. What you have put up is mean and the standard deviation. What a standard deviation indicates is: What is the spread of the values in a given sample? That means: Suppose you have a sample of 100. Your mean is 10.5. And your standard deviation is say, for example, 4. What it means is: The values in your whole sample will probably vary from 10.5 minus 4 to 10.5 plus 4. It is that sort of a thing. Now, usually the standard deviation is smaller than the mean, in a normally distributed or a non-skewed dataset. Whenever you have a dataset which is skewed, or is not normally distributed, that is when you get a standard deviation which is larger, almost equal, or even larger than the mean. So to answer your question, what you have put up is standard deviation, and yes, it usually is much, much smaller than the mean. But that would indicate that the distribution of your values is normal. But in conditions where the standard deviation is equal or even larger than the mean, it indicates that your sample is skewed. Generally speaking, if you have a standard deviation — if you have a value of two times the standard deviation, which is lesser than the mean, that means SD multiplied by 2 is lesser than the mean, then you are usually having a normally distributed thing in front of you. In conditions where you are not sure, again, statistical calculators are freely available online. You just have to put your Excel data sheet into the software, and it will tell you if your distribution is normal or it is non-normal. That will help you decide which test you apply to get the difference. That will ensure that you do not get an erroneous value of the p values and the confidence intervals. I hope that answers the question. So on that, any other questions that any of the audience would like to ask? Right. So in case there are no more questions, I would take your leave. I hope all of you enjoyed the lecture. I hope all of you gained something out of this. And I would be really happy if these things are taken back home and most of you try to pull out a few journal articles and just try to revise by applying whatever we learned today, and see if you can make sense of those journal articles. I’ll try and see if my mail ID can reach every one of you, and in case you will have any statistical doubts, any time, please feel free to write to me, and I’ll try to get back to you within 24 hours. Thank you very much! You’ve been a really, really good audience. Thank you so much.

Download Slides


August 18, 2017

Last Updated: October 31, 2022

2 thoughts on “Lecture: Statistics in Medicine Part II: Interpreting Probability Values”

  1. Thank you cybersight for this one stop resource.

    I have not been able to download the lecture on Statistics in Medicine II. I have not experienced this before with other videos.


Leave a Comment