What percentage of patients had irritable bowel syndrome as their primary diagnosis?
Topic : Subject Area : Number of Pages : Type of Document : Spacing : Style : Academic Level : Preferred Language :
Details:
Use MS Word to complete “Questions to be Graded: Exercise 27” in Statistics for Nursing Research: A Workbook for Evidence-Based Practice. Submit your work in SPSS by copying the output and pasting into the Word document. In addition to the SPSS output, please include explanations of the results where appropriate.
chapter 27
Calculating Descriptive Statistics
There are two major classes of statistics: descriptive statistics and inferential statistics.
Descriptive statistics are computed to reveal characteristics of the sample data set and to
describe study variables. Inferential statistics are computed to gain information about
effects and associations in the population being studied. For some types of studies,
descriptive statistics will be the only approach to analysis of the data. For other studies,
descriptive statistics are the fi rst step in the data analysis process, to be followed by inferential
statistics. For all studies that involve numerical data, descriptive statistics are
crucial in understanding the fundamental properties of the variables being studied. Exercise
27 focuses only on descriptive statistics and will illustrate the most common descriptive
statistics computed in nursing research and provide examples using actual clinical
data from empirical publications.
MEASURES OF CENTRAL TENDENCY
A measure of central tendency is a statistic that represents the center or middle of a
frequency distribution. The three measures of central tendency commonly used in nursing
research are the mode, median ( MD ), and mean ( X ). The mean is the arithmetic average
of all of a variable ’ s values. The median is the exact middle value (or the average of the
middle two values if there is an even number of observations). The mode is the most
commonly occurring value or values (see Exercise 8 ).
The following data have been collected from veterans with rheumatoid arthritis ( Tran,
Hooker, Cipher, & Reimold, 2009 ). The values in Table 27-1 were extracted from a larger
sample of veterans who had a history of biologic medication use (e.g., infl iximab [Remicade],
etanercept [Enbrel]). Table 27-1 contains data collected from 10 veterans who had
stopped taking biologic medications, and the variable represents the number of years that
each veteran had taken the medication before stopping.
Because the number of study subjects represented below is 10, the correct statistical
notation to refl ect that number is:
n 10
Note that the n is lowercase, because we are referring to a sample of veterans. If the
data being presented represented the entire population of veterans, the correct notation
is the uppercase N. Because most nursing research is conducted using samples, not populations,
all formulas in the subsequent exercises will incorporate the sample notation, n.
Mode
The mode is the numerical value or score that occurs with the greatest frequency; it does
not necessarily indicate the center of the data set. The data in Table 27-1 contain two
EXERCISE
27
292 EXERCISE 27 • Calculating Descriptive Statistics
Copyright © 2017, Elsevier Inc. All rights reserved.
modes: 1.5 and 3.0. Each of these numbers occurred twice in the data set. When two
modes exist, the data set is referred to as bimodal ; a data set that contains more than
two modes would be multimodal .
Median
The median ( MD ) is the score at the exact center of the ungrouped frequency distribution.
It is the 50th percentile. To obtain the MD , sort the values from lowest to highest. If the
number of values is an uneven number, exactly 50% of the values are above the MD and
50% are below it. If the number of values is an even number, the MD is the average of the
two middle values. Thus the MD may not be an actual value in the data set. For example,
the data in Table 27-1 consist of 10 observations, and therefore the MD is calculated as
the average of the two middle values.
MD
1 5 2 0
2
1 75
. .
.
Mean
The most commonly reported measure of central tendency is the mean. The mean is the
sum of the scores divided by the number of scores being summed. Thus like the MD, the
mean may not be a member of the data set. The formula for calculating the mean is as
follows:
X
X
n
where
X = mean
Σ = sigma, the statistical symbol for summation
X = a single value in the sample
n = total number of values in the sample
The mean number of years that the veterans used a biologic medication is calculated
as follows:
X
0 1 0 3 1 3 1 5 1 5 2 0 2 2 3 0 3 0 4 0
10
1 9
. . . . . . . . . .
. years
TABLE 27-1 DURATION OF BIOLOGIC USE AMONG
VETERANS WITH RHEUMATOID
ARTHRITIS ( n = 10)
Duration of Biologic Use (years)
0.1
0.3
1.3
1.5
1.5
2.0
2.2
3.0
3.0
4.0
Calculating Descriptive Statistics • EXERCISE 27 293
Copyright © 2017, Elsevier Inc. All rights reserved.
The mean is an appropriate measure of central tendency for approximately normally
distributed populations with variables measured at the interval or ratio level. It is also
appropriate for ordinal level data such as Likert scale values, where higher numbers represent
more of the construct being measured and lower numbers represent less of the
construct (such as pain levels, patient satisfaction, depression, and health status).
The mean is sensitive to extreme scores such as outliers. An outlier is a value in a
sample data set that is unusually low or unusually high in the context of the rest of the
sample data. An example of an outlier in the data presented in Table 27-1 might be a
value such as 11. The existing values range from 0.1 to 4.0, meaning that no veteran used
a biologic beyond 4 years. If an additional veteran were added to the sample and that
person used a biologic for 11 years, the mean would be much larger: 2.7 years. Simply
adding this outlier to the sample nearly doubled the mean value. The outlier would also
change the frequency distribution. Without the outlier, the frequency distribution is
approximately normal, as shown in Figure 27-1 . Including the outlier changes the shape
of the distribution to appear positively skewed.
Although the use of summary statistics has been the traditional approach to describing
data or describing the characteristics of the sample before inferential statistical analysis,
its ability to clarify the nature of data is limited. For example, using measures of central
tendency, particularly the mean, to describe the nature of the data obscures the impact
of extreme values or deviations in the data. Thus, signifi cant features in the data may be
concealed or misrepresented. Often, anomalous, unexpected, or problematic data and
discrepant patterns are evident, but are not regarded as meaningful. Measures of dispersion,
such as the range, difference scores, variance, and standard deviation ( SD ), provide
important insight into the nature of the data.
MEASURES OF DISPERSION
Measures of dispersion , or variability, are measures of individual differences of the
members of the population and sample. They indicate how values in a sample are dispersed
around the mean. These measures provide information about the data that is not
available from measures of central tendency. They indicate how different the scores are—
the extent to which individual values deviate from one another. If the individual values
are similar, measures of variability are small and the sample is relatively homogeneous
in terms of those values. Heterogeneity (wide variation in scores) is important in some
statistical procedures, such as correlation. Heterogeneity is determined by measures of
variability. The measures most commonly used are range, difference scores, variance, and
SD (see Exercise 9 ).
FIGURE 27-1 ■ FREQUENCY DISTRIBUTION OF YEARS OF BIOLOGIC
USE, WITHOUT OUTLIER AND WITH OUTLIER.
0
Frequency
Frequency
0-.9 1-1.9 2-2.9 3-3.9 4-4.9 0-0.9 1-1.9 2-2.9 3-3.9 4-4.9 5-5.9 6-6.9 7-7.9 8-8.9 9-9.9 10-10.9 11-11.9
Years of biologic use Years of biologic use
3.0
2.5
2.0
1.5
1.0
0.5
0
3.0
2.5
2.0
1.5
1.0
0.5
294 EXERCISE 27 • Calculating Descriptive Statistics
Copyright © 2017, Elsevier Inc. All rights reserved.
Range
The simplest measure of dispersion is the range . In published studies, range is presented
in two ways: (1) the range is the lowest and highest scores, or (2) the range is calculated
by subtracting the lowest score from the highest score. The range for the scores in Table
27-1 is 0.3 and 4.0, or it can be calculated as follows: 4.0 − 0.3 = 3.7. In this form, the
range is a difference score that uses only the two extreme scores for the comparison. The
range is generally reported but is not used in further analyses.
Difference Scores
Difference scores are obtained by subtracting the mean from each score. Sometimes a
difference score is referred to as a deviation score because it indicates the extent to which
a score deviates from the mean. Of course, most variables in nursing research are not
“scores,” yet the term difference score is used to represent a value ’ s deviation from the
mean. The difference score is positive when the score is above the mean, and it is negative
when the score is below the mean (see Table 27-2 ). Difference scores are the basis for many
statistical analyses and can be found within many statistical equations. The formula for
difference scores is:
X −X
of absolute values:9.5
TABLE 27-2 DIFFERENCE SCORES OF DURATION OF
BIOLOGIC USE
X – X X – X
0.1 − 1.9 − 1.8
0.3 − 1.9 − 1.6
1.3 − 1.9 − 0.6
1.5 − 1.9 − 0.4
1.5 − 1.9 − 0.4
2.0 − 1.9 0.1
2.2 − 1.9 0.3
3.0 − 1.9 1.1
3.0 − 1.9 1.1
4.0 − 1.9 2.1
The mean deviation is the average difference score, using the absolute values. The
formula for the mean deviation is:
X
X X
n deviation
−
In this example, the mean deviation is 0.95. This value was calculated by taking the
sum of the absolute value of each difference score (1.8, 1.6, 0.6, 0.4, 0.4, 0.1, 0.3, 1.1, 1.1,
2.1) and dividing by 10. The result indicates that, on average, subjects ’ duration of biologic
use deviated from the mean by 0.95 years.
Variance
Variance is another measure commonly used in statistical analysis. The equation for a
sample variance (s 2 ) is below.
s
X X
n
2
2
1
−
−
Calculating Descriptive Statistics • EXERCISE 27 295
Copyright © 2017, Elsevier Inc. All rights reserved.
Note that the lowercase letter s 2 is used to represent a sample variance. The lowercase
Greek sigma ( σ 2 ) is used to represent a population variance, in which the denominator is
N instead of n − 1. Because most nursing research is conducted using samples, not populations,
formulas in the subsequent exercises that contain a variance or standard deviation
will incorporate the sample notation, using n − 1 as the denominator. Moreover, statistical
software packages compute the variance and standard deviation using the sample formulas,
not the population formulas.
The variance is always a positive value and has no upper limit. In general, the larger the
variance, the larger the dispersion of scores. The variance is most often computed to derive
the standard deviation because, unlike the variance, the standard deviation refl ects important
properties about the frequency distribution of the variable it represents. Table 27-3
displays how we would compute a variance by hand, using the biologic duration data.
s2 13 41
9
.
s ² 1.49
TABLE 27-3 VARIANCE COMPUTATION OF BIOLOGIC
USE
X X X – X ( X – X )2
0.1 − 1.9 − 1.8 3.24
0.3 − 1.9 − 1.6 2.56
1.3 − 1.9 − 0.6 0.36
1.5 − 1.9 − 0.4 0.16
1.5 − 1.9 − 0.4 0.16
2.0 − 1.9 0.1 0.01
2.2 − 1.9 0.3 0.09
3.0 − 1.9 1.1 1.21
3.0 − 1.9 1.1 1.21
4.0 − 1.9 2.1 4.41
Σ 13.41
Standard Deviation
Standard deviation is a measure of dispersion that is the square root of the variance.
The standard deviation is represented by the notation s or SD . The equation for obtaining
a standard deviation is
SD
X
−
−
X
n
2
1
Table 27-3 displays the computations for the variance. To compute the SD , simply take
the square root of the variance. We know that the variance of biologic duration is
s2 = 1.49. Therefore, the s of biologic duration is S D = 1.22. The S D is an important statistic,
both for understanding dispersion within a distribution and for interpreting the
relationship of a particular value to the distribution.
SAMPLING ERROR
A standard error describes the extent of sampling error. For example, a standard error
of the mean is calculated to determine the magnitude of the variability associated with
the mean. A small standard error is an indication that the sample mean is close to
296 EXERCISE 27 • Calculating Descriptive Statistics
Copyright © 2017, Elsevier Inc. All rights reserved.
the population mean, while a large standard error yields less certainty that the sample
mean approximates the population mean. The formula for the standard error of the mean
( sX ) is:
s
s
n X
Using the biologic medication duration data, we know that the standard deviation of
biologic duration is s = 1.22. Therefore, the standard error of the mean for biologic duration
is computed as follows:
sX 1 22
10
.
sX 0.39
The standard error of the mean for biologic duration is 0.39.
Confi dence Intervals
To determine how closely the sample mean approximates the population mean, the standard
error of the mean is used to build a confi dence interval. For that matter, a confi dence
interval can be created for many statistics, such as a mean, proportion, and odds ratio.
To build a confi dence interval around a statistic, you must have the standard error value
and the t value to adjust the standard error. The degrees of freedom ( df ) to use to compute
a confi dence interval is df = n − 1.
To compute the confi dence interval for a mean, the lower and upper limits of that
interval are created by multiplying the sX by the t statistic, where df = n − 1. For a 95%
confi dence interval, the t value should be selected at α = 0.05. For a 99% confi dence interval,
the t value should be selected at α = 0.01.
Using the biologic medication duration data, we know that the standard error of the
mean duration of biologic medication use is sX 0.39 . The mean duration of biologic
medication use is 1.89. Therefore, the 95% confi dence interval for the mean duration of
biologic medication use is computed as follows:
X sX t
1.89 0.392.26
1.89 0.88
As referenced in Appendix A , the t value required for the 95% confi dence interval with
df = 9 is 2.26. The computation above results in a lower limit of 1.01 and an upper limit
of 2.77. This means that our confi dence interval of 1.01 to 2.77 estimates the population
mean duration of biologic use with 95% confi dence ( Kline, 2004 ). Technically and mathematically,
it means that if we computed the mean duration of biologic medication use
on an infi nite number of veterans, exactly 95% of the intervals would contain the true
population mean, and 5% would not contain the population mean ( Gliner, Morgan, &
Leech, 2009 ). If we were to compute a 99% confi dence interval, we would require the t
value that is referenced at α = 0.01. Therefore, the 99% confi dence interval for the mean
duration of biologic medication use is computed as follows:
1.89 0.393.25
1.89 1.27
Calculating Descriptive Statistics • EXERCISE 27 297
Copyright © 2017, Elsevier Inc. All rights reserved.
As referenced in Appendix A , the t value required for the 99% confi dence interval with
df = 9 is 3.25. The computation above results in a lower limit of 0.62 and an upper limit
of 3.16. This means that our confi dence interval of 0.62 to 3.16 estimates the population
mean duration of biologic use with 99% confi dence.
Degrees of Freedom
The concept of degrees of freedom ( df ) was used in reference to computing a confi dence
interval. For any statistical computation, degrees of freedom are the number of independent
pieces of information that are free to vary in order to estimate another piece of
information ( Zar, 2010 ). In the case of the confi dence interval, the degrees of freedom are
n − 1. This means that there are n − 1 independent observations in the sample that are
free to vary (to be any value) to estimate the lower and upper limits of the confi dence
interval.
SPSS COMPUTATIONS
A retrospective descriptive study examined the duration of biologic use from veterans
with rheumatoid arthritis ( Tran et al., 2009 ). The values in Table 27-4 were extracted from
a larger sample of veterans who had a history of biologic medication use (e.g., infl iximab
[Remicade], etanercept [Enbrel]). Table 27-4 contains simulated demographic data collected
from 10 veterans who had stopped taking biologic medications. Age at study enrollment,
duration of biologic use, race/ethnicity, gender (F = female), tobacco use (F = former
use, C = current use, N = never used), primary diagnosis (3 = irritable bowel syndrome, 4
= psoriatic arthritis, 5 = rheumatoid arthritis, 6 = reactive arthritis), and type of biologic
medication used were among the study variables examined.
TABLE 27-4 DEMOGRAPHIC VARIABLES OF VETERANS WITH RHEUMATOID ARTHRITIS
Patient
ID
Duration
(yrs) Age Race/Ethnicity Gender Tobacco Diagnosis Biologic
1 0.1 42 Caucasian F F 5 Infl iximab
2 0.3 41 Black, not of
Hispanic Origin
F F 5 Etanercept
3 1.3 56 Caucasian F N 5 Infl iximab
4 1.5 78 Caucasian F F 3 Infl iximab
5 1.5 86 Black, not of
Hispanic Origin
F F 4 Etanercept
6 2.0 49 Caucasian F F 6 Etanercept
7 2.2 82 Caucasian F F 5 Infl iximab
8 3.0 35 Caucasian F N 3 Infl iximab
9 3.0 59 Black, not of
Hispanic Origin
F C 3 Infl iximab
10 4.0 37 Caucasian F F 5 Etanercept
298 EXERCISE 27 • Calculating Descriptive Statistics
Copyright © 2017, Elsevier Inc. All rights reserved.
This is how our data set looks in SPSS.
Step 1: For a nominal variable, the appropriate descriptive statistics are frequencies and
percentages. From the “Analyze” menu, choose “Descriptive Statistics” and “Frequencies.”
Move “Race/Ethnicity and Gender” over to the right. Click “OK.”
Calculating Descriptive Statistics • EXERCISE 27 299
Copyright © 2017, Elsevier Inc. All rights reserved.
Step 2: For a continuous variable, the appropriate descriptive statistics are means and
standard deviations. From the “Analyze” menu, choose “Descriptive Statistics” and
“Explore.” Move “Duration” over to the right. Click “OK.”
INTERPRETATION OF SPSS OUTPUT
The following tables are generated from SPSS. The fi rst set of tables (from the fi rst set of
SPSS commands in Step 1) contains the frequencies of race/ethnicity and gender. Most
(70%) were Caucasian, and 100% were female.
Frequencies
Frequency Table
RaceEthnicity
Frequency Percent Valid Percent Cumulative
Percent
Valid
Black, not of Hispanic Origin 3 30.0 30.0 30.0
Caucasian 7 70.0 70.0 100.0
Total 10 100.0 100.0
Gender
Frequency Percent Valid Percent Cumulative
Percent
Valid F 10 100.0 100.0 100.0
300 EXERCISE 27 • Calculating Descriptive Statistics
Copyright © 2017, Elsevier Inc. All rights reserved.
Descriptives
Statistic Std. Error
Duration of Biologic Use
1.890 .3860
Lower Bound 1.017
Upper Bound 2.763
1.872
1.750
1.490
1.2206
.1
4.0
3.9
2.0
.159 .687
-.437 1.334
Mean
95% Confidence Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
The second set of output (from the second set of SPSS commands in Step 2) contains the
descriptive statistics for “Duration,” including the mean, s (standard deviation), SE , 95%
confi dence interval for the mean, median, variance, minimum value, maximum value,
range, and skewness and kurtosis statistics. As shown in the output, mean number of
years for duration is 1.89, and the SD is 1.22. The 95% CI is 1.02–2.76.
Explore
Calculating Descriptive Statistics • EXERCISE 27 301
Copyright © 2017, Elsevier Inc. All rights reserved.
STUDY QUESTIONS
1. Defi ne mean.
2. What does this symbol, s2 , represent?
3. Defi ne outlier.
4. Are there any outliers among the values representing duration of biologic use?
5. How would you interpret the 95% confi dence interval for the mean of duration of biologic use?
6. What percentage of patients were Black, not of Hispanic origin?
7. Can you compute the variance for duration of biologic use by using the information presented
in the SPSS output above?
302 EXERCISE 27 • Calculating Descriptive Statistics
Copyright © 2017, Elsevier Inc. All rights reserved.
8. Plot the frequency distribution of duration of biologic use.
9. Where is the median in relation to the mean in the frequency distribution of duration of biologic
use?
10. When would a median be more informative than a mean in describing a variable?
Copyright © 2017, Elsevier Inc. All rights reserved. 303
Answers to Study Questions
Duration of biologic use
Mean = 1.89
Std. Dev. = 1.221
N = 10
3
2
1
0
0 1.0 2.0 3.0 4.0 5.0
Frequency
1. The mean is defi ned as the arithmetic average of a set of numbers.
2. s2 represents the sample variance of a given variable.
3. An outlier is a value in a sample data set that is unusually low or unusually high in the context
of the rest of the sample data.
4. There are no outliers among the values representing duration of biologic use.
5. The 95% CI is 1.02–2.76, meaning that our confi dence interval of 1.02–2.76 estimates the
population mean duration of biologic use with 95% confi dence.
6. 30% of patients were Black, not of Hispanic origin.
7. Yes, the variance for duration of biologic use can be computed by squaring the SD presented
in the SPSS table. The SD is listed as 1.22, and, therefore, the variance is 1.22 2 or 1.49.
8. The frequency distribution approximates the following plot:
9. The median is 1.75 and the mean is 1.89. Therefore, the median is lower in relation to the
mean in the frequency distribution of duration of biologic use.
10. A median can be more informative than a mean in describing a variable when the variable ’ s
frequency distribution is positively or negatively skewed. While the mean is sensitive to outliers,
the median is relatively unaffected.
Copyright © 2017, Elsevier Inc. All rights reserved. 305
Questions to Be Graded EXERCISE 27
Follow your instructor ’ s directions to submit your answers to the following questions for grading.
Your instructor may ask you to write your answers below and submit them as a hard copy for
grading. Alternatively, your instructor may ask you to use the space below for notes and submit your
answers online at http://evolve.elsevier.com/Grove/statistics/ under “Questions to Be Graded.”
1. What is the mean age of the sample data?
2. What percentage of patients never used tobacco?
3. What is the standard deviation for age?
4. Are there outliers among the values of age? Provide a rationale for your answer.
5. What is the range of age values?
Name: _______________________________________________________ Class: _____________________
Date: ___________________________________________________________________________________
306 EXERCISE 27 • Calculating Descriptive Statistics
Copyright © 2017, Elsevier Inc. All rights reserved.
6. What percentage of patients were taking infl iximab?
7. What percentage of patients had rheumatoid arthritis as their primary diagnosis?
8. What percentage of patients had irritable bowel syndrome as their primary diagnosis?
9. What is the 95% CI for age?
10. What percentage of patients had psoriatic arthritis as their primary diagnosis?
Essay
DOUBLE
APA
Under Graduate
English (U.S.)
statistics
Nursing
2