Lecture: Statistics in Medicine Part IV: Linear Regression and ANOVA
During this live lecture the following are discussed:
Understanding linear regression
How does correlation coefficient help us interpret data?
F Test, analysis of variance (ANOVA) and comparison of group means
Linear regression and ANOVA are some of the most commonly used tools in interpreting medical research data. One needs to understand how to interpret data in our own experiments and also understand how other researchers might interpret data in their studies. We aim to simplify concepts on simple and multiple linear regression and understanding analysis of variance in medical literature.
Lecturer: Dr. Aravind Roy, MS
(To translate please select your language to the right of this page)
DR. ROY: Hello, everybody. Welcome to the series of lectures. We’re going to start the discussion today on the linear progression. So, a disclosure before starting the talk I am not a trained statistician, but I understand statistics in interpreting clinical research. If we have a discussion, if there are doubts during the discussion regarding concepts or anything else that you might be left with. Please do. Ask the questions. Type the questions in. So, we’ll start the discussion and we’ll travel on through the talk.
As I start this lecture, will you tell me what is your position in which you work?
Okay. So that’s nice to know. Okay. So, let’s start with linear regression. Linear regression can be of two types. It’s either simple or multiple linear regression. Simple has one independent variable which affects one response or dependent variable. We will take up questions and examples later.
The other type is multiple linear regression where there are several predictor variables and won response variable.
So, we have scenarios where these two concept of simple and multiple linear regression come in. So, when we talk about relationship between the variables, we should understand that there are two different types of relationships. One is deterministic, the other is a statistical relationship. So deterministic, as we see here suppose the length of the object is measured in inches. And there is a formula by which you can say exactly how much it would be in centimeters, meaning one inch to 2.5 centimeters. If we take the example of the graph on the right, if we say we correlate height with weight. As the height increases, we expect that the weight would also increase. It may not be directly linearly, but it will increase. But it is not an exact relationship. It is a statistical relationship. Why is it important that when we are talking in terms of linear regression, it does not follow through deterministic relationships, but it can be used for statistical relationships?
Now linear regression makes some assumptions. If we go back to a previous example, there is an assumption that as the height increases, the weight will also keep on increasing. Second, if it is a hyperbolic relationship, then the equation will not work. We will use a coefficient of correlation and R the. In addition, we will also see that there are confidence intervals which may overlap to see the mean response in each group.
If we go to the height and weight increase example. As the height increases we so in the data points that the weight also keeps on increasing. One can predict the slope to best suggest the relationship between these two availables. And the tangent can be calculated by the equation y1=b0 Mrs. B is X1. The B1 is the slope of D, the discussion of data points.
So, this can be if we see the scatter plot, we can draw some linear lines, all of which pass through these data points. How do we know what is the best line? That straight line, which has the least errors of prediction is, the black straight line is the line which has the least. That means they are more close to the black line, meaning that that what we’re predicting is almost equal to what we’re observing when we predict the weight from the height of the individuals. This is what the equation means. I hope the concepts are clear. If not, ask questions.
So, the linear regression model makes an assumption that there is a linear function. The errors are independent. Meaning that the error at one sector are independent of the other obvious observations, and these errors can be distributed. If we go back to this example, the errors at this point is not related to the errors of prediction at this point over here and vice versa. They can have an anomalous variance and distribution. When all of the criteria fit together, we can look into a linear regression, calculate a linear regression and allow what would be the best line in this set of data points.
So, in addition, we also need to know that the slope is also very important here. So, if the slope is exactly … there is no association between the two lines. But even at a steeper axis, that means there is a linear and a strong association. Meaning that if we see there are two types of data that we’re examining over here. If we calculate the weight from the height, we know that if the weight increases that the body mass is also going to increase, if not proportionately, but to some extent.
If we see the cells in AC match the color? Maybe, maybe not. We don’t care. As we see in our scatter plot here, the data points are actually thrown all over the chart. Almost everywhere the data points are scattered. So, when the software calculates the linear equation over here, you can see that it is almost parallel to the X axis. What it means is that the association is not very strong. Okay?
So, this association is calculated by a formula which is called the R squared, which is the coefficient determination. We will learn more about it in the subsequent by the full discussion. What it means is how strong is the association between the what? Between two variables, that is the predictor and the predictor variable.
So R squared, what does it mean to you as an ophthalmologist? For example, in the weight and height example, how much does the height affect weight? Usually, R squared is a value between 0 and 1. Values close to 0 means there is not much association between the variables. And if it is close to 1, that means that the association is much more stronger. It can have a plus and minus sign also. That showed that it can be a positive and negative correlation. However, we need to have a disclaimer that association is not causation. So, we need to interpret data points in context of the literature that we are studying. If we go back to our previous example of the height/weight, here we know that if the height increases, the weight can also keep on increasing, but we also know that though statistical software can help us calculate increases in the AC to the color of the iris, the question can be calculated, we know that it doesn’t really make sense. And the data points are all over the place. The association is not too great. You get the R squared much more close to 0. Now, which R the, how do you interpret R squared?
If we see back that the R squared is .6, .7, .8, you will need to multiply by 100. If you multiply by .6 or .7, you will get something close to 60, 70, and that means that it is close to 60 or 70% due to that predictor variable. You also need to understand that the strength of association also differs from study to study. So, in American study, 60% association is affected by the lens position by a factor of another 40%. If this is good for a medical study, it will not be good for an engineering study where you cannot say, for example, how many times a plane is going to crash if your predictor variables are something that will be changed. So, an engineering study, you will need to be much more accurate. If R squared is 60, that means that engineering device or that mechanical part or say, for example, that technology is going to fail 40% of the time, and maybe succeed only 60 or 70% of time. That may not be very good in the context of that industry or that research question. But in a medical research question, anything that is better than .5 usually is considered to be a strong association. So, you also need to understand that the R squared coefficient of determination is not only determined by the strength of association how strong it is or how close it is to is, but also in the context of the research question. Okay?
Now we go to the ANOVA and F test. ANOVA is a means of three or more groups that are different. When we have three or more tests connected simultaneously, are the means similar or dissimilar? So, for example, if drug A is used to decrease blood pressure, and then we have drug B and drug C there are different groups and the researcher is trying to find out that which drug decreases the blood pressure by the most or there is no difference among different groups. So how do we calculate which statistic we use? Here we can use the ANOVA and we can use the ANOVA to calculate whether the means are same or different. Why ANOVA? Because if we compare the groups from drug A to drug B, drug A to drug C and in different combinations, then we are not taking the total population for the research question together, which brings us to error. If there are only two groups, we can use it. When there are at least more than groups, minimum of three. And then we are comparing the means of the three groups. And it also uses something called a F statistic. What is the F test? And F statistic is what we are going study in our subsequent discussion.
So, I have tried to minimize the mathematics here and simply show the concepts. There are statistical tests by which these are calculated. Probably in a later session you might need to understand what is the data collected or rather what we are studying. The technical part will be clear a little later on. There are several nice sources which are there, which can help you calculate all of this and understand it from a more mathematical point of view. The context of the current discussion is more to understand what is the principle behind this and how we are going to calculate if the F tests, regression, and then interpret it to study it. So, the F test is basically a variation between group means to the variation within samples. So, when we are studying three or four more groups, then the F actually helps us understand what is the variation between the group means compared all together and what is the variation within each sample? So basically, if the F value is low, then the example laid out more close to each other, the means are more equal. And if it is large, then there is a lot of variation, but it doesn’t tell us which is better and which is worse. It only tells us that the means are not equal to each other.
So, in linear regression, as in ANOVA, we come across a concept which is called the sum squares. So, what the sum square means is it is a measure of dispersion of data. When the data are throughout groups, the data can be all over the place or it can be very homogenous or tightly packed. So, a measure of discussion is what we are studying. That helps us calculate the standard deviation.
Suppose we look into this table. So, there are several data points and one variable. So, does sum square calculated? It is calculated by summing up the number of observations that we make, and then we also sum the square of the observed variable. And the rest of the observations in this example. So, if we see how the calculation is done, we need to sum of X, which is the variable which is being calculated, and then square it. We get 196, and then we divide it by the number of observations, which is 28. So, when we subtract from the sum squared, we get a value of 8. This is the sum square for this dispersion update.
So, if we take another example, so there can be some more data points. There is a mean for this, which is 5, and what is the deviation from mean so that from each data point, there is a deviation, which is like 2, 0, 0, 0, and minus 2. Minus and plus is not important, what is important is how much absolute deviation from mean. If we add the definitions from mean, we get the sum squared. So, this is the easy way of doing sum squares, but why do we want to follow it? I’ll touch on that shortly.
So, if we do it the old way, like we calculate the X2 and then subtract them, it comes to the same. So, we add up all the X, we add up all the X2, and from that we divide it by 5, and we get the sum 2 value of eight. So, if we see there are two ways of calculating, this is a sigma, and this is a statistical matter. Why should we use it? The reason we use it when we have large data, when the mean is not a round figure. Suppose instead of 5, it is 5.67, it is more beneficial this way.
I hope it has been clear so far, the discussion. Then we talk about something called the variance. Variance is how much scores deviate from the mean. As we saw, there is a table that discusses the values that we see, the means that are there, and then there is a radiance, which is the deviation from the mean. When we add that, we get this answer. The reason we are looking for the variance is we want to see how much is the variance between each group, and what is the variation within each group. So that gives us the idea of the Fs, which helps us calculate the abnormal from the sample groups. So, the variance is calculated where N is the number of groups. So, when the number of samples are less than conventional, we use N 1. We subtract one from the example, and we get a variance of 2. So, if we go back to our previous example, the sum 2 is 8, and there are small sample sites, so we have to subtract 1, and 8 divided by 4 is 2. The standard deviation is the square root of that. Here it would be 1.4 in the example.
Okay. So, the ANOVA approach is if we wish to compare the means of three or more groups, and this is different as I said earlier, than comparing each of the groups separately, we need to construct a table. Suppose we have different groups. Group 1, 2, 3, 4, etc. And each group would have a particular sample size. Each group would have a particular standard and deviation. There is a null hypothesis, and it is that all the sample means are equal. The alternate hypotheses are that the means are not equal. So, our null hypothesis can say that all the means are not equal, meaning that different interventions may have different results. They are not the same. If we reject the mean hypothesis, it means the means are not equal and it is a non homogenous distribution. This is how we construct a null hypothesis.
The F statistic, as we have discussed earlier, is the variation between group means within a variation within each sample. This is the formula by which the statistic is calculated. There are several parameters to that. We calculate the number of samples and groups. So, F and D, these can at times seen daunting and difficult to understand. So, for simplicity, I have just mentioned that the F measures the variation between group means of different groups and what is the variation between that. The equation that you see on your screen is the statistical way of calculating it. It determines upon the outcome, meaning what is the degree of freedoms you use.
There is a table of probabilities, which you can actually select the F for a particular. And there are two degrees of freedom. What it means is how many there is a numerator degree of freedom and denominator degree of freedom. It depends on the number of observations you are taking from each group and the number of groups are there. N is the observation within each group. Okay?
So, this is just to give you an overview of how the statistical software will calculate them, okay? So, we will be focusing more on understanding what the ANOVA will mean. You will use online calculators also. If the null hypothesis is true, what does it mean? It means that the variation will not be more than the variation within groups. And the F value will be small. If you will remember the cartoon that was shown on here, it will show how the data is clustered, how closely they are. It means there is a lot of homogeneity. If the null hypothesis is false, then there is a lot of variation. The F value is large. There is a lot of dispersion of data. It does not say that which group is better than another. It just says that they do not match each other, okay?
Then we also construct A alpha and rejecting a statistic based on how the data is taken. So, the rejection range for this example where we have the F statistic. This can be calculated online through the tables that are there. Okay?
So, if we just construct the ANOVA table, or how the data calculates, it calculates the sum squares. The sum square for the error, versus what you have observed and what you predicted. It calculates the degree of freedom, and then it calculates something by the mean squares. When we divide the mean squares through the error, we get the F statistic. We have found the F through the tables online or any other reference from which you have taken. And we have a second level of significance. And then we see whether the F value is large or small. If the F value is large, it means it is more significant. If the value is small, you can say that the null hypothesis is true. Okay?
So, with this background, we will go over a few examples. So, for example, if we try to see that there are different athletes who are doing different type of exercises such as jogging, aerobics or weight lifting. I’ll measure that the heart rates were different for each individual during each exercise. So, can we say that which exercise increases the heart rate the most? This is the study question. So, the null hypothesis is that none of the exercises increase heart rate … they are all the same. When we run it through the statistical software that test for significance and ANOVA, we can get summary data. And if you calculate the sum squares and standard deviation, they will display something like this. There are online calculators which can do all of the data and you don’t need to calculate it yourself. You just need to fill in the data. You can use any number of groups, and we calculate the statistic from there.
It finds out for you what is the sum square between treatments, the degrees of difference, and what is the mean squares. When we divide the mean squares, we get the F value with a significance of less than 0.05. We are getting an F value of 0.58. So quite understandably, this value is larger, and it has no significance, which means that the group means are very similar in all the three variables.
So, when we study, there are things called one factor and two factor ANOVA. What is one factor ANOVA? One factor ANOVA means that we study the effect of a treatment or intervention across different groups. For example, whether exercise causes a change in heart rate such as the example we were studying here. Okay?
So, here it means that there is only one factor that is studied: Exercise. And what is the change in the heart rate. In two factor ANOVA, we study an additional parameter. Like different heart rate in different genders: Male versus females. That type of a model will be called two factor ANOVA.
So, when we go back to our examples on linear regression, suppose a researcher used a new IOL to calculate the final visual acuity for the following axial lengths. And then he used a linear regression plot and found the scatter shown in the next slide.
So, this is the scatter that the researcher found. Now what can we conclude from this experiment? What does it mean? Does it mean that as the axial length increases the IOL … the final visual acuity will also change? What it also means is can we calculate from the axial length the predictability? And is that linear regression or not? That is our study question.
So, the other questions which crop up when we look at such a clinical example is, is the formula having linearity? There is a linear distribution of data points. If it is linear, where is it more accurate? And what is the correlation coefficient tell us? If we look at this example, it tells us that most of the study subjects are clustered, close to the 6.6 or 20/20 vision for axial lengths somewhere between 19 through 24. But if you consider the axial lengths from 26 to 28, we see that the final visual acuity in log MAR is much higher. It is like 6.60, it’s much wider. So, though there is a linear relationship, that means as the axial length increases, in this new formula, the final visual acuity is much closer to this or this range only. There is some linearity. The third thing we look at is that the data points are not all over the place. They are closely hugging the predicted interceptor, the predictor line. For the data points which are within 24 millimeters, the predictor final visual acuity is close to .6. So close to .7 or .9 that we got, the association was very strong for predicted final visual acuity for the shorter axial lengths. That means that visual acuity for this formula can be more accurate on the strength association or the strength of correlation is more on the normal axial length as compared to the higher axial length. We can also conclude that the axial lengths, when they are larger, this formula is not very accurate.
Now that we have understood linear regression and ANOVA, where to use which test? So, suppose we are predicting continuous data points, continuous variables such as blood pressure with another continuous variable such as axial length. Then we use the formula which is regression formula.
Suppose we’re using a categorical plot, solid or foldable, the type of surgery. So that with a continuous variable and any log value. Then we use ANOVA. And if you are using continuous with categorical, then we use log regression. You can take into account this cartoon which tells us that if there are two continuous variables, we should use regression. If we are using continuous variables, use ANOVA. If the other, use logistic regression. What is the difference between regression and logistic regression?
Okay? So, I end this lecture with this, and thank you for your patient hearing: I will look forward to questions from you before we conclude this session.
>> Thanks, Dr. Roy. It doesn’t look like there are any questions yet, but maybe we will wait for five minutes.
>> DR. ROY: Yes.
>> If anyone has questions, you can type them in the Q&A box.
>> DR. ROY: One of the viewers asked if we can do this in Excel. You can do it in Excel. There are statistical calculators in there. There are also online calculators that you can use.
The other question is what is repeated questions ANOVA. I’m not sure I understood the question. Can you be more specific, please? If I understand measures, it means if we take a population and do a subgroup analysis. Suppose we go back to the different type of exercises, and then we see the heart rate. So, if we measure it again and again, instead of taking so many subjects, then we take five subjects or ten subjects. If there is no difference between the means, that means they are similar. I hope that answers your question.
The software that you use for this, actually, there are several softwares that are available online. You can Google it, ANOVA online calculator. And it will give you a range of softwares that you can use for your test results. And you can actually field them online. Once you print them, the data comes out as abnormal data. It will have something like the sum squares, the degree of freedom and the F statistic. The aim of the talk was to make you understand what it means, what sum squares means, what the F statistic means, so on and so forth. But you can have the independence to use any software that you want online. It is to understand in which context you will use it, how you will interpret the results, and what it means for you as a clinical researcher.
The other question is, do we use post hoc for ANOVA. I’m not sure I understood the question. Could you be a little more specific please?
The difference between R and R squared. R is the core correlation efficient and R squared is the coefficient of determination. They show the strength of association. That is not causation, but how strongly the predictor variable is linked to the predictor variable. So, if we go back to height versus weight example, the R squared tells us how much will you attribute the height to the weight? And to get R from R squared, you do a square root of the R square value. When you do a square root of that, you get the value. The R square is usually between 0 to 1, and it can have a plus or a minus sign, which denotes a positive or negative correlation. And the R can be anywhere between minus one to plus one, where minus one and plus one indicates a strong positive or negative correlation and 0 indicates that there is no correlation.
Post hoc tests to confirm the difference of means. So yeah, that’s the ANOVA table that you get. So, when you get the ANOVA table, you can actually confirm the means, and the F test helps you calculate the means and either accept or reject the null hypothesis. I hope that answers your question.
>> So maybe we’ll wait a few more minutes to see if there are any final questions.
>> DR. ROY: Yes, sure. Thank you.
>> That seems like all the question. Do you want to end it here?
>> DR. ROY: If we don’t have any further questions, we can end the session. Thank you for the excellent support.
>> Thank you so much for presenting.
>> DR. ROY: Thank you. Bye bye.