ENVIRONMENTAL HEALTH

ENVIRONMENTAL HEALTH- STATISTICS OVERVIEW: 4 ASSIGNMENTS

PART 1 (HOMEWORK ASSIGNMENT #1)

HERE ARE 7 RANDOMLY COLLECTED DATA VALUES. WE REFER TO THESE AS ‘’X-VALUES’’: 25, 30, 15, 10, 20, 35, 40, AND THE NUMBER OF DATA POINTS (IN THIS CASE 7) IS REFERRED TO AS ‘’n’’

1) DRAW A SCATTER PLOT OF THESE SEVEN NUMBERS AND LOOK IT OVER. WHAT WOULD YOU GUESTIMATE THE MEAN (AVERAGE) TO BE? ___25_________

2) NOW, CALCULATE THE MEAN (AVERAGE) OF THESE SEVEN NUMBERS: ____25_____. HOW CLOSE DID YOU COME?

3) FILL IN THE FOLLOWING TABLE TO SEE WHAT THE AVERAGE AND SUBSEQUENTLY THE VARIANCE TELL US. KEEP IN MIND THAT THE (X-M) VALUES CAN BE NEGATIVE AS WELL AS POSITIVE. FOR THE LAST COLUMN DO THE SUBTRACTION FIRST, THEN SQUARE THAT RESULT (SIMPLY SQUARE THE VALUE YOU GOT IN THE SECOND COLUMN).

X-VALUE X-VALUE – MEAN (X-VALUE – MEAN) ^2
10 10-25 225
15 15-25 100
20 20-25 25
25 25-25 0
30 30-25 25
35 35-25 100
40 40-25 225
TOTAL 700

WHAT TOTAL DID YOU GET FOR SECOND COLUMN (X-VALUE – MEAN): __0____? I ALSO HOPE YOU REMEMBERED THAT WHEN YOU SQUARE A NEGATIVE NUMBER IT BECOMES POSITIVE. THE SUM OF THE VALUES IN THE THIRD COLUMN IS USED TO CALCULATE THE VARIANCE, WHICH IS SIMPLY LIKE THE AVERAGE DISTANCE DATA POINTS ARE FROM THE MEAN. THERE IS A WRINKLE IN CALCULATING THIS AVERAGE DISTANCE. WE DIVIDE THE TOTAL OF THESE DISTANCES (BOTTOM LINE OF THE THIRD COLUMN) BY (n – 1) NOT ‘’n’’. SO, IT’S DIVIDED BY (7 – 1) = 6 AND NOT 7. DIVIDE THE TOTAL OF THE THIRD COLUMN BY 6 AND YOU GET: __116.666______. THIS IS THE VARIANCESIMPLY TAKE THE SQUARE ROOT OF THAT VARIANCE TO GET THE STANDARD DEVIATION, WHICH IS: __10.8012_______. (FYI: THE (n-1) IS REFERRED TO AS THE ‘’DEGREES OF FREEDOM’’ AND IT’S A SIMPLE CONCEPT, BUT NOT ONE WE NEED TO GO INTO HERE)

KEEP IN MIND THAT THE STANDARD DEVIATION, WHICH IS A SQUARE ROOT, IS BOTH POSITIVE AND NEGATIVE

BOTTOM LINE: THE GREATER THE DISTANCES THE DATA POINTS ARE FROM THE MEAN THERE, THE GREATER THE VARIANCE, HENCE THE LESS CONFIDENCE WE CAN HAVE IN THAT MEAN.

”A” ”B” ”C”
14 87 55
30 88 58
22 69 53
3 75 50
76 70 56
54 91 60
33 60 54
90 79 51
6 100 58
45 68 57

NOW, USING THE THREE SETS OF DATA ABOVE: 4) DRAW A SCATTER PLOT OF EACH ONE (SEPARATE PLOTS NOT ALL ON ONE GRAPH)

5) CALCULATE THE MEAN FOR EACH DATA SET (FEEL FREE TO USE SOFTWARE FROM HERE ON, NOW THAT YOU KNOW HOW THAT SOFTWARE IS DOING THE CALCULATIONS).

6) BY LOOKING AT THE SCATTER PLOTS, WHICH MEAN WOULD YOU HAVE THE MOST CONFIDENCE IN AND WHY? C because the numbers are plotted in a tighter group and not as spread as A or B groups.

7) CALCULATE THE VARIANCE AND STANDARD DEVIATION FOR EACH DATA SET. (SOFTWARE IS FINE)

8) WHICH DATA SET HAS THE SMALLEST VARIANCE HENCE STANDARD DEVIATION? DOES THIS AGREE WITH YOUR OBSERVATIONAL CHOICE FOR CONFIDENCE IN THE MEAN YOU PICKED? C had the smallest variance hence standard deviation and it did agree with my choice.

UNUSUAL DATA OR RARE EVENTS (Certain conditions apply but for a large data set):

· 68% of those data fall between the mean and one SD above and below it (mean +/- 1 SD)

· 94% fall between the mean and two SD’s above and below it (mean +/- 2 SD’s)

· 99% fall between the mean and three SD’s above and below it (mean +/- 3 SD’s)

As a rule of thumb, an ‘’UNUSUAL’’ data value (an ‘’outlier’’) is considered to be one that is more than TWO SD’s from the mean either direction (+/-)

So, how do you handle outliers? Do you simply delete them? NO – there must be logical and justifiable reasons for doing so, such as lab error, mis-copied numbers, faulty equipment. The reason(s) for deletion MUST be documented. WHAT IF A PHARMACEUTICAL COMPANY THREW OUT THE ONE DEATH IN 1000 DATA VALUE? WANT TO TAKE THAT MEDICATION?

9) SO, ARE ANY OF THE DATA VALUES IN ANY OF THESE THREE DATA SETS POSSIBLY ‘’UNUSUAL’’ OUTLIERS? LOOK AT THE DATA AND DO THE CALCULATIONS IF YOU FEEL IT IS POSSIBLE. (HINT: THERE IS ONE)

I would say A

PART 2 (HOMEWORK ASSIGNMENT #2)

MEANS COMPARISON: HOW DO WE DETERMINE IF THERE IS A SIGNIFICANT DIFFERENCE BETWEEN TWO (OR MORE) MEANS? FOR EXAMPLE, IN AN EXPERIMENTAL TEST OF A NEW MEDICATION WE HAVE THE CONTROL GROUP WHO IS NOT GIVEN THE MEDICATION AND THE TEST GROUP THAT IS. LET’S SAY WE MEASURE THE DAYS A RASH LASTS. OBVIOUSLY, WE WANT FEWER DAYS WITH THE MEDICATION. HERE ARE THE HYPOTHETICAL RESULTS ( n = 10):

CONTROL: 5, 7, 4, 9, 5, 7, 6, 6, 9, 8 EXPERIMENTAL: 4, 5, 6, 7, 6, 8, 8, 7, 8, 5

10) DRAW A SCATTER PLOT OF THESE DATA BUT THIS TIME PUT THEM ON ONE GRAPH (USE A DIFFERENT SYMBOL / COLOR FOR EACH SET SO WE CAN TELL WHICH DATA VALUE GOES WITH WHICH GROUP)

11) CALCULATE THE MEAN FOR EACH GROUP AND THE VARIANCE (V) AND STANDARD DEVIATION (SD)

12) DOES LOOKING AT THIS COMBINED SCATTER PLOT TELL YOU THAT THE MEDICATION WORKED? HOW CONFIDENT ARE YOU THAT THERE IS A DIFFERENCE IN THE TWO MEANS: 50%, 75%, 90%, 95%, 99% ? WHY?

IN STAT 200 YOU WILL LEARN HOW TO CALCULATE A CONFIDENCE INTERVAL (CI) AROUND EACH MEAN. THIS IS A CALCULATED DISTANCE THAT LIKE A STANDARD DEVIATION EXTENDS ABOVE AND BELOW THE MEAN. IT IS ALSO TYPICALLY SHORTER THAN ONE STANDARD DEVIATION AND IT HAS A PROBABILITY FACTOR INVOLVED RELATED TO YOUR CHOSEN CONFIDENCE LEVEL. YOUR ‘’CONFIDENCE’’ CAN TYPICALLY BE 90%, 95% (MOST COMMONLY USED), OR 99%. THE 99% CONFIDENCE LEVEL IS USED FOR CRITICAL MEASUREMENTS LIKE DRUG TEST RESULTS.

RULE: THE GREATER THE CONFIDENCE LEVEL SOUGHT, THE WIDER THE CONFIDENCE INTERVAL WILL BE: A 99% CI WILL BE WIDER THAN A 90% CI. SO, THE LOWER THE CONFIDENCE LEVEL THE NARROWER THE CI.

FOR OUR OBSERVATIONAL PURPOSES HERE INDICATE WHERE THE MEAN IS FOR EACH GROUP AND WHERE THE END- POINT IS 1 SD ABOVE AND 1 SD BELOW THOSE MEANS. HOW MUCH OVERLAP IS THERE? DOES THIS CHANGE YOUR MIND ON WHETHER THE MEDICATION WORKS OR NOT?

AS I INDICATED AT THE BEGINNING, PROBABILITY PLAYS A CRITICAL ROLE IN STATISTICAL ANALYSIS. THE CONFIDENCE INTERVALS DESCRIBED ABOVE ARE DEFINED BY PROBABILITY. FOR EXAMPLE, A 95% CI MEANS THAT 95% OF THE SAMPLES TAKEN WILL HAVE MEANS THAT FALL IN THE RANGE OF THAT CI. (IT’S TECHNICALLY NOT THAT WE ARE 95% CONFIDENT THAT THE TRUE MEAN IS IN THAT RANGE. THIS IS TYPICALLY A TEST QUESTION ON STAT 200 EXAMS. )

PART 3 (HOMEWORK ASSIGNMENT #3)

FREQUENCY TABLES AND PROBABILITY. UP TO NOW WE HAVE BEEN DEALING WITH SMALL DATA SETS WHERE DRAWING A SCATTER PLOT WAS PHYSICALLY POSSIBLE. BUT, WHAT IF WE HAD 10,000 DATA POINTS INSTEAD OF THE 10 OR SO WE HAVE USED UP TO NOW? HOW DO WE HANDLE THOSE?

VERY SIMPLE: WE PUT THEM INTO GROUPS OR RANGES – FOR EXAMPLE 1 – 10, 11 – 20, 21 – 30, ETC. THE KEY HERE IS THAT THESE GROUPS MUST NOT OVERLAP (NOT 1-10, 10-20, 20-30)

WE DO THIS ‘’GROUPING’’ IN WHAT IS A ‘’FREQUENCY TABLE’’. IT HAS FOUR COLUMNS:

COLUMN 1: LISTS THE RANGES.

COLUMN 2: IS THE ‘’FREQUENCY’’ AND IS SIMPLY THE NUMBER OF DATA POINTS IN THAT RANGE (NOT THE ACTUAL DATA VALUES THEMSELVES). THE FREQUENCIES MUST ADD UP TO THE TOTAL NUMBER OF DATA POINTS.

COLUMN 3: IS THE ‘’RELATIVE FREQUENCY’’ WHICH IS THE NUMBER OF DATA PINTS IN THAT RANGE DIVIDED BY THE TOTAL NUMBER OF DATA POINTS. FOR EXAMPLE IF 7 OF 100 DATA POINTS ARE IN A GIVEN RANGE THEN 7/100 OR 7% OF OUR TOTAL DATA POINTS ARE IN THAT RANGE. (7 WOULD HAVE BEEN ENTERED IN COLUMN 2 AS THE FREQUENCY)

COLUMN 4: IS THE CUMULATIVE RELATIVE FREQUENCY. HERE ALL WE DO HERE IS ADD UP THE RELATIVE FREQUENCIES AS WE GO DOWN THE COLUMN. THE GOAL HERE IS THAT THE CUMULATIVE FREQUENCY ENTERED FOR THE LAST (BOTTOM) RANGE MUST BE 100% OR 1.00.

TRY IT: HERE IS A DATA SET WITH 100 NUMBERS (n = 100):

70 52 89 53 27 39 97 4 91 93
3 37 33 19 68 38 79 61 51 88
64 43 47 63 17 22 72 43 73 23
63 22 79 48 79 65 47 45 12 64
55 55 78 6 18 18 56 99 10 92
28 51 88 31 30 85 83 42 39 94
55 81 70 64 96 58 40 20 85 35
46 76 49 26 49 27 17 76 49 94
18 11 3 63 4 7 57 46 68 61
8 16 68 6 90 49 74 52 8 89

(IF YOU CAN USE EXCEL YOU CAN RANK ORDER (LOW TO HIGH) THESE DATA POINTS, WHICH HELPS)

13) FREQUENCY TABLE: (FILL IT OUT)

RANGES FREQUENCY RELATIVE FREQUENCY CUMULATIVE RF
1 – 10
11 – 20
21 – 30
31 – 40
41 – 50
51 – 60
61 – 70
71 – 80
81 – 90
91 – 100
SO, WHAT DOES A FREQUENCY TABLE TELL US?
IF YOU WROTE EACH NUMBER ON A PING-PONG BALL AN PUT ALL 100 OF THEM INTO A BUCKET AND DREW ON OUT BLINDLY
14) WHAT IS THE PROBABILITY THAT BALL WILL HAVE A NUMBER BETWEEN 11 AND 20 ?

15) WHAT IS THE PROBABILITY THE NUMBER IS BETWEEN 31 AND 70 (INCLUSIVE)?

WE USE THE FREQUENCY OR RELATIVE FREQUENCY VALUES TO CREATE A DESCRIPTIVE HISTOGRAM (OR BAR CHART):

The BAR CHART is for discrete data (2, 5, 7, 9) , whereas the HISTOGRAM is for continuous data (1.23, 3.4, 5.6). If you did your Frequency or Relative Frequency columns correctly, the above are what those plots would look like.

PART 4 (HOMEWORK ASSIGNMENT #4)

With REGRESSION and CORRELATION we have TWO variables: an independent variable and dependent variable. In the case of our RASH MEDICATION (Part 2) the INDEPENDENT variable would be the dosage and the DEPENDENT variable is of course the days the rash lasts. We will assume that we can buy different dosages (in milligrams) of that medication. Here are some made up results:

DOSAGE (MG) 1 1.25 1.5 1.75 2 2.5 3 4 5
DAY TO RELIEF 8 6 5.5 5 4.5 4 3.9

(BURNING)

3.8

(BURNING)

3.9

(BURNING)

16) DRAW A SCATTER PLOT OF THESE TWO VARIABLES: THE INDEPENDENT VARIABLE (DOSAGE) GOES ALONG THE HORIZONTAL X-AXIS AND THE DEPENDENT VARIABLE (DAYS TO RELIEF) GOES UP THE VERTICAL Y-AXIS.

NOW WHAT? LOOK AT THE PLOT. SEE A PATTERN LIKE ANYTHING BELOW? (from Statistics Using Technology by Kathryn Kozak and is available under a Creative Commons Attribution-ShareAlike 3.0 United States license.

17) TWO QUESTIONS WE WANT TO ANSWER ARE: “Is there a relationship between two variables?” (REGRESSION) and “How strong is that relationship?” (CORRELATION). SO, WHAT IS YOUR ANSWER FOR THE RASH MEDICATION? DOES THE DOSAGE AFFECT TIME TO RELIEF, OR NOT? ANY OTHER OBSERVATIONS BASED ON THE DATA OR SCATTER PLOT? WOULD YOU PUT ANY ‘’CAUTIONS’’ ON THE MEDICATION’S PACKAGE?

An example of a NEGATIVE correlation would be vehicle speed verses mpg: the faster your speed the lower your gas mileage. Or for a NON-LINEAR correlation example, imagine the ‘’smile’’ in the example above was a ‘’frown’’. This is typical of polymer added to water or wastewater to remove suspended solids. Removal improves as we increase the polymer dosage, but then it peaks out and after that the more polymer we add the worse the solids removal)

BUT ALWAYS KEEP IN MIND: A CORRELATION DOES NOT PROVE AN ACTUAL RELATIONSHIP BY ITSELF – IT MERELY SUPPORTS OTHER SOUND SCIENCE OBSERVATIONS OR DATA.

In Stat 200 you would learn how to actually quantify the strength of the correlation and develop and equation to calculate the possible relationship for other dosages. THAT’S ALL FOLKS !!

BAR CHART OF FREQUENCY

10.0 10.0 8.0 8.0 13.0 11.0 14.0 9.0 9.0 8.0

HISTOGRAM OF FREQUENCY

10.0 10.0 8.0 8.0 13.0 11.0 14.0 9.0 9.0 8.0