# CNU BST 322 Regression and Correlation Coefficient

CNU BST 322 Regression and Correlation Coefficient ORDER NOW FOR CUSTOMIZED AND ORIGINAL ESSAY PAPERS ON CNU BST 322 Regression and Correlation Coefficient Collaborate Summary: four points for a two-page summary of the Collaborate lecture. Bullets and outline format are fine. Students can annotate the written lecture document with thoughtful notes as another way to get credit. CNU BST 322 Regression and Correlation Coefficient week_four_collaborate_slides_revised_june_2020.pptx BST 322 Week Four Slides Revised June 22, 2020 Brooks Ensign, MBA, M.Acc. Deadlines Week Four ( end of course): Final Exam in MyStatLab, Independent Project, Wk 4 HW, Discussion Questions, MyStatLab ASK ME FOR HELP !!! MyStatLab Final Exam This Week: Week Four Our agenda this week: PREDICTIONS? Scatterplot ? Correlation calculation ? Correlation calculation ? Derive regression equation ? Regression equation ? use to predict (maybe if significant) Consider confounding variables and multivariate regression ANCOVA: introduce(lightly) This Week: Week Four Review Correlation from Week One ( Ch. 4) Algebra: draw a line with two points, and get the slope and intercept: gives you the equation (simplified regression process) Regression: simple bivariate (two variables) Multivariate: > 1 independent variables (x1, x2 , x3 ) Ch. 9: simple bivariate, Ch. 10: multivariate Ch. 11 (just first 6 pages): Intro. To ANCOVA This Week: Regression Chapters 9 and 10 (and a tiny intro bit of 11) With interval or ratio data: From scatterplots, to correlations, to regressions: deriving an equation to describe the data, and (maybe) using the equation to predict values Y prime = Y = a plus (b times x) Y = the predicted value of y B = slope, and a = intercept Regression Regress: step back and analyze In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables. Regression equations can be used to predict values, if Maybe predict? CNU BST 322 Regression and Correlation Coefficient Explanation: MyStatLab has strict rule: if the correlation is statistically significant ( p value less than 0.05), then, and only then, you can use the regression equation to predict. Otherwise, you simply use the mean (average) value for the dependent variable. Class in DQ1: gray area: p value of 0.06, but we will use it anyway (borderline) Preview: Correlation -> Regression ->Prediction This week we focus on Chapter 9 (all of it), the first half of Chapter 10 (light treatment), and the first third of Chapter 11 (very light treatment), This week, our test statistics are (r) : r lower case r, for the simple correlation and regression with two variables R upper case R: for multivariate regression: more than one independent variable; we give multivariate regression a light treatment Significance? In order to declare that our results are significant (i.e. probably not random), We need to reject the null hypothesis and we need: A LARGE test statistic and a very small p value Test statistics: t F . Chi ( ? ) and now r and R For significance of r see page 199 and page 418 For significance of R see page 231 (stay tuned) They Work Together: Think of the test statistic and the p value as the opposite ends of a seesaw. They work in opposite directions. For statistical significance, we want a large r or R test statistic (larger than the table value) and a small p value (smaller than 0.05, i.e., alpha). r or R Test statistic greater than table value P value less than alpha (0.05), the level of significance Review If the absolute value we find for the test statistic is > than the tabled value (at a certain level of significance (?) or P value) Or if we get a P Value < 0.05 then the null hypothesis is rejected and the result is significant. Correlation and Regression (Review) We did correlation when we covered scatterplots in ch. 4 (Pearsons r) This value r is calculated from a sample of data The population value of r (the correlation coefficient) is rho (?) (the lowercase Greek r) We study r in a sample as an estimate of the rho (? ) correlation in the population Remember that r is easily calculated in StatCrunch Scatterplot ? CNU BST 322 Regression and Correlation Coefficient Correlation, and now Correlation ? Regression equation-> prediction Regression: derive an equation from the correlation (if the correlation is statistically significant and strong enough to be predictive) y = a plus (b times x) with a = intercept and b = slope y prime = y (is the predicted value for y) Correlation in StatCrunch Click: Stats > Summary Stats > Correlation Slope, Correlation and R2 CONTRAST THESE (they are different) 1. Slope: rise / run ; b in y = a plus (b)*x Slope: steep? 2. r = Correlation: from Week One, r = does a change in y relate to a change in x? Strong Correlation can have low slope! 3. R2 = regression answer: proportion of variance (how much of variance is explained?); also known as coefficient of determination R2 = r times r Correlation is not slope (rise over run) Tight fit? Or messy Correlation is the degree of fit to a line: is it tight (very close to being a line, ie. Correlation of 0.7 0.9), or is it . Weak correlation (none) is: A messy cloud (zero or low correlation, i.e., 0.1 ) ? Perfect ( r = 1.0 ) Correlation; Slope is 0.1 8 7 6 Strong Correlation, with low slope 5 4 3 2 1 0 0 1 2 3 4 5 6 7 Correlation as a test statistic Now we can look at the Pearsons r value in terms of it being a test statistic Are the values we see significant? The Null hypothesis here is that the correlation value is . H0: r = ? A. zero there is no relationship B. not zero there is a relationship C. 0.5 there is a weak relationship Vote now! Correlation as a test statistic H0: rho ? = 0 H1: rho ? ? 0 The Null hypothesis here is that there is ( no relationship, no correlation, r very small, close to zero) Any ideas from students? What is the null hypothesis? What is the alternative hypothesis? Discuss Correlation as a test statistic H0: rho ? = 0 H1: rho ? ? 0 The Null hypothesis here is that there is no relationship between the variables in the population (r = 0) SEE PAGE 199 So we compare the test statistic r (which we use as an estimate of rho ?) to the critical value in the table (p.418) CNU BST 322 Regression and Correlation Coefficient Again, if the test statistic (absolute value) is > than a certain critical value then the null hypothesis is rejected and the result is significant or we let the computer tell us by making it calculate the exact P value (and just compare that to 0.05) Easy Way The easy way to determine statistical significance of the regression: the p value of the slope (not the p value of the intercept) Is the p value of the slope less than 0.05? Correlation-Regression ExampleBetter Charts Bad Y Good 4.0 3.5 3.0 Weight Gain After Overeating 2.5 2.0 Y 1.5 3.5 0.5 0.0 0 100 200 300 400 See Polit p.35 for more tips on graphs 500 600 700 800 Fat gain (kilograms) 1.0 y = -0.0033x + 3.3413 R² = 0.6211 3.0 2.5 2.0 1.5 1.0 100 200 300 400 500 600 Nonexercise activity (calories) 700 Significance of r This was optional in week one; it is now required (easy w StatCrunch: p value?) Four Slides from Week One (see p. 199, top): Follow these instructions to test the significance of your correlation in the Independent Project, #5 and #6 Required for Question Six in the independent project: test the significance of your correlation coefficient (page 199) Meaning of r value (page 71) Pearsons r can be between 0 and 1 for positive correlation and 0 and negative 1 for negative correlation. Positive correlation: 0 < r < 0.2 is weak, 0.2 to 0.5 is moderate, 0.5 to 0.7 is stronger, and >0.7 is very strong (these are rough descriptions) Negative: strong correlation if less than -0.5, weak if between -0.2 and 0. Value of Zero or near Zero: No Correlation Is r value significant? The easiest way: Look at the p value of the slope in your StatCrunch results (bottom right corner) Is the p value of the slope: < 0.05 ?? Parameter estimates: Parameter Estimate Std. Err. Alternative DF T-Stat P-Value Intercept Slope 665.7143 131.6546 ? 0 5 5.0565214 0.0039 -0.6989286 0.29438862 ? 0 5 -2.3741696 0.0636 Is r value significant? Week 2: Significant in statistics means not random. (rather than important) Test of significance for Pearsons r (top of page 199 and page 418) Calculate d.f. (degrees of freedom): N-2, with N being the number of data points Notice: at the very bottom of the table: a low r value can be significant with a large data sample At the very top of these tables: small samples require LARGE test statistics Discussion Question One: Large r value may not be significant with a small sample size (top rows in tables) Vs. Contrast this with the bottom of the table: a small r value may be significant with a large data set Is r value significant? Refer to page 199 (top) Refer to page 418: Use shaded column (0.05) CNU BST 322 Regression and Correlation Coefficient Find the row that corresponds to d.f. (degrees of freedom); e.g., 10 -2 = 8 d.f. If your calculated r value is greater than the table value, then the calculated r value is significant (non-random). Is r value significant? Question 14 in W1 homework (week one): Ten data points, d.f = N-2 = 8 Page 418: shaded column (? = 0.05) Page 418: Table A.6, row: d.f. = 8 Table value: 0.632 r value is significant if greater than 0.632 Is r value significant? (yes, 0.91 > 0.632) Test significance of r value in Independent Project StatCrunch Discussion Question One W4 DQ one: the 0.73 r value seems large (and significant?) but it is not quite significant, because: the data set is very small Remember: we predict maybe? l This is the only close call in our course 0.728 < table value of 0.754 (how did I find this table value on page 418 using the guidance from page 199?) Regression Regression: use the equation derived from the correlation / scatterplot StatCrunch does all of this for us Regression: use the equation to PREDICT Regression in StatCrunch Click: Stat: ? Regression ? Simple linear Regression in StatCrunch: fill in the template Prediction in StatCrunch (using Regression) StatCrunch Discussion Question One Simple linear regression results: Equation: y = intercept minus b times x Dependent Variable: Cholesterol cholesterol = 665 0.69 times Caffeine Independent Variable: Caffeine Cholesterol = 665.7143 0.6989286 Caffeine Sample size: 7 R (correlation coefficient) = -0.728 R-sq = 0.52992857 Estimate of error standard deviation: 155.77582 R and R-squared Parameter estimates: l Parameter Estimate Intercept Slope Std. Err. Alternative DF T-Stat P-Value 131.6546 ?0 5 5.0565214 0.0039 -0.6989286 0.29438862 ?0 5 -2.3741696 0.0636 665.7143 Significance? Look for the p value of the slope 0.06 > . 05 Not sig. (but very close) Answers How do you answer the questions in the discussion questions and the homework? CNU BST 322 Regression and Correlation Coefficient See the next few slides !! Discussion Question One (use this for Homework Q-7 also ) Q: r What is the correlation coefficient r and what does it mean in this case? A: The correlation coefficient (r)=-.728 which means there is a strong, negative correlation. Q: r2 What is the coefficient of determination and what does it mean in this case? A: The coefficient of determination is r2. In this case it is equal to .53. This means that 53% of the variation in cholesterol is explained by the independent variable. Q: Is there a statistically significant correlation between caffeine intake and cholesterol levels in this case? A: The table value is .754 and the absolute value of r = .73. Because the calculated value does not exceed the table value then there is not statistical significance (very close, but not quite). Discussion Question One The correlation seems strong, but it is not quite significant How many more data points do you need? (one or two) Note: this is strong correlation that lacks significance (very small sample) We can also have weak correlation, with significance (in a large sample): look at the bottom of page 418 small values Discussion Question One Using regressions to PREDICT: Difference in Methodology: MyStatLab teaches us that we only use regressions to predict, if the regression is statistically significant. Otherwise we just use the average value But this Discussion Question asks you to predict, using this equation, which is not quite signficant Sometimes statistical approaches differ; this is the only borderline example in this class, but there are many in real life StatCrunch Discussion Question One The numbers here are slightly different from your discussion question USE THE NUMBERS IN THE DQ DONT JUST COPY THESE NUMBERS Discussion Question One: Predictions Q: What is the intercept? CNU BST 322 Regression and Correlation Coefficient (or what would be your cholesterol level while ingesting no caffeine?) A: The intercept is 665.714. That would be the cholesterol level while ingesting 0 mg of caffeine. Q: What is the slope? (or, what is what we call b in the linear regression equation?) A: The slope ( or b in the linear regression equation) is -0.636 Simple linear regression results: Dependent Variable: Cholesterol Independent Variable: Caffeine Cholesterol = 665.7143 0.6989286 Caffeine Sample size: 7 R (correlation coefficient) = -0.728 R-sq = 0.52992857 Estimate of error standard deviation: 155.77582 Discussion Question One Parameter estimates: Parameter Intercept Slope Estimate Std. Err. Alternative DF T-Stat P-Value 131.6546 ?0 5 5.0565214 0.0039 -0.6989286 0.29438862 ?0 5 -2.3741696 0.0636 665.7143 Use the p value of the slope it is the same as the p value for the correlation. 0.06 is > than 0.05, so the results are not quite statistically significant. ? P value of the slope Is 0.06 Discussion Question One: use a regression to predict c) How many cups of coffee must you drink to lower your total cholesterol to 150 mg/dL (given that 1 cup of coffee equals 100 mg of caffeine)? ALGEBRA x=(150-665.714)/(-0.636) x=810/100 mg 8 cups Better way: use the StatCrunch prediction tool for the DQ and for the HW Q7 StatCrunch: Prediction Scroll down in the Simple Linear Regression screen, until you see this: Enter the value of X (the assumed value of X) and StatCrunch will calculate the predicted Y value, based on the regression equation EC Discussion Question Three Optional: but interesting (fun and easy) CNU BST 322 Regression and Correlation Coefficient The Most Important Question in This Course No math! This is your chance to use what you have learned in this course, to Recognize the mistakes and misconceptions in the medical literature; some statistical studies are poorly designed / executed Misadventures Skim the Vox Article (design of medical research studies) http://www.vox.com/2015/1/5/7482871/types-of-study-design Misadventures? Look at Misadventures in this site: http://www.improvingmedicalstatistics.com/index.html http://www.improvingmedicalstatistics.com/entry_media.h tm http://www.improvingmedicalstatistics.com/entry_high_sc hool.htm http://www.improvingmedicalstatistics.com/Biased%20pro tocol.htm Choose one of the examples cited. Write a short paragraph: identify the article and identify the abuse / misuse of statistical analysis. NOTE: These research articles are prominent, recent medical articles (WITH MISTAKES !!! ??? ). Regression & Multiple Regression Regression (bivariate in ch. 9) one x variable -used to make predictions about the values of variables once we know their relationship easiest linear -use the equation of a line to predict y variable values, with one x Multiple Regression (multivariate in ch. 10) an extension of simple linear regression where we use two or more x variables (factors) to predict the value of the dependent variable YOU LEARNED IN THIS CLASS: The word factor is used in this class instead of cause. We recognize that explanations usually involve multiple factors. What is the cause of my hypertension? Trick question, because there is not one cause of hypertension (and many other medical conditions) There are multiple contributing factors: salt, stress, genetics, diet, exercise, caffeine, decongestants, medicine Multiple factors: these questions are addressed with multivariate regression in CH 10 ( and MF ANOVA, in Ch- 7) Capital R: Multiple Regression (we just want the basics in ch. 10) Factors: x variables: x1 , x2, x3, x4, etc. Adding another factor (x variable) will not make the regression worse, but the added benefit will drop as you add x variables (overlapping variance) (*) Goal: increase R2 (percentage of variance explained) Every x variable DOES NOT have to be significant Generally: we may want about 2 4 variables (*) b values (weights) b1 and b2 and b3 are only valid for until you change the combination of variables (*) Using too many factors may be overfitting # of Factors and r-squared If you can get the same r-squared value with fewer factors, that is a better choice (more robust, more scalable, less overfitting) Is your R value statistically significant? HOW TO DECIDE? LOOK AT YOUR F VALUE (AND PAGE 231) Homework The F Statistic: pages 230-231 Used for the test of significance with multiple regressions Is your R value statistically significant? CNU BST 322 Regression and Correlation Coefficient Use your F value to answer this question Is calculated F greater than table F? If so: Is your R value statistically significant? Page 413 Multiple Regression F Statistic See Page 231 For Homework Assignment Page 231: df (b) equals k df (within) = N k -1 F statistic in Multiple Regression Using the following information for R2, k, and N, calculate the value of the F statistic for testing the overall regression equation and determine whether F is statistically significant at the 0.05 level Example: R2 = 0.53, k = 5, N = 120 (See table on page 413) (R2 / k ) / [(1- R2)/(N k -1)] = F = 25.71 > tabled F = 2.29; significant Reject Null Hypothesis. F statistic in Multiple Regression Using the following information for R2, k, and N, calculate the value of the F statistic for testing the overall regression equation and determine whether F is statistically significant at the 0.05 level Example: R2 = 0.53, k = 5, N = 120 (See table on page 413) (0.53/5) / (1-0.53)/(120-5-1) = F = 25.71 > tabled F = 2.29; significant Reject Null Hypothesis. ANCOVA: (ch.11, introduction) ANALYSIS of COVARIANCE very light treatment with NO MATH, but a very interesting concept (controlling for and adjusting for confounding variables) ANCOVA is similar to ANOVA: The assumptions include all of the ANOVA assumptions Confounding variables In plain English: Are we comparing apples to oranges? i.e., are there differences in the comparison that we are not properly considering? Is this a fair comparsion? Differences: confounding variables. Using ANCOVA (in words) Freeze (control for) the variance from lurking (confounding) variables, to study the effect of the variable of interest Example in the written lecture (common in clinical studies): control for back pain (or other clinical endpoint variable) at baseline (beginning), in order to study the effects of drugs in reducing this back pain Otherwise: the drugs appear to be ineffective, and the people with less back pain at baseline continue to have less back pain at the end ANC Get a 10 % discount on an order above $ 100 Use the following coupon code : NURSING10