# What percentage of patients had irritable bowel syndrome as their primary diagnosis?

Topic :  Subject Area :  Number of Pages :  Type of Document :  Spacing :  Style :  Academic Level :  Preferred Language :

Details:

Use MS Word to complete “Questions to be Graded: Exercise 27” in Statistics for Nursing Research: A Workbook for Evidence-Based Practice. Submit your work in SPSS by copying the output and pasting into the Word document. In addition to the SPSS output, please include explanations of the results where appropriate.

chapter 27

Calculating Descriptive Statistics

There are two major classes of statistics: descriptive statistics and inferential statistics.

Descriptive statistics are computed to reveal characteristics of the sample data set and to

describe study variables. Inferential statistics are computed to gain information about

effects and associations in the population being studied. For some types of studies,

descriptive statistics will be the only approach to analysis of the data. For other studies,

descriptive statistics are the fi rst step in the data analysis process, to be followed by inferential

statistics. For all studies that involve numerical data, descriptive statistics are

crucial in understanding the fundamental properties of the variables being studied. Exercise

27 focuses only on descriptive statistics and will illustrate the most common descriptive

statistics computed in nursing research and provide examples using actual clinical

data from empirical publications.

MEASURES OF CENTRAL TENDENCY

measure of central tendency is a statistic that represents the center or middle of a

frequency distribution. The three measures of central tendency commonly used in nursing

research are the mode, median ( MD ), and mean ( ). The mean is the arithmetic average

of all of a variable ’ s values. The median is the exact middle value (or the average of the

middle two values if there is an even number of observations). The mode is the most

commonly occurring value or values (see Exercise 8 ).

The following data have been collected from veterans with rheumatoid arthritis ( Tran,

Hooker, Cipher, & Reimold, 2009 ). The values in Table 27-1 were extracted from a larger

sample of veterans who had a history of biologic medication use (e.g., infl iximab [Remicade],

etanercept [Enbrel]). Table 27-1 contains data collected from 10 veterans who had

stopped taking biologic medications, and the variable represents the number of years that

each veteran had taken the medication before stopping.

Because the number of study subjects represented below is 10, the correct statistical

notation to refl ect that number is:

10

Note that the is lowercase, because we are referring to a sample of veterans. If the

data being presented represented the entire population of veterans, the correct notation

is the uppercase N. Because most nursing research is conducted using samples, not populations,

all formulas in the subsequent exercises will incorporate the sample notation, n.

Mode

The mode is the numerical value or score that occurs with the greatest frequency; it does

not necessarily indicate the center of the data set. The data in Table 27-1 contain two

EXERCISE

27

292 EXERCISE 27 • Calculating Descriptive Statistics

modes: 1.5 and 3.0. Each of these numbers occurred twice in the data set. When two

modes exist, the data set is referred to as bimodal ; a data set that contains more than

two modes would be multimodal .

Median

The median MD ) is the score at the exact center of the ungrouped frequency distribution.

It is the 50th percentile. To obtain the MD , sort the values from lowest to highest. If the

number of values is an uneven number, exactly 50% of the values are above the MD and

50% are below it. If the number of values is an even number, the MD is the average of the

two middle values. Thus the MD may not be an actual value in the data set. For example,

the data in Table 27-1 consist of 10 observations, and therefore the MD is calculated as

the average of the two middle values.

MD

1 5 2 0

2

1 75

. .

.

Mean

The most commonly reported measure of central tendency is the mean. The mean is the

sum of the scores divided by the number of scores being summed. Thus like the MD, the

mean may not be a member of the data set. The formula for calculating the mean is as

follows:

X

X

n



where

= mean

Σ = sigma, the statistical symbol for summation

= a single value in the sample

= total number of values in the sample

The mean number of years that the veterans used a biologic medication is calculated

as follows:

0 1 0 3 1 3 1 5 1 5 2 0 2 2 3 0 3 0 4 0

10

1 9

. . . . . . . . . .

. years

TABLE 27-1 DURATION OF BIOLOGIC USE AMONG

VETERANS WITH RHEUMATOID

ARTHRITIS ( = 10)

Duration of Biologic Use (years)

0.1

0.3

1.3

1.5

1.5

2.0

2.2

3.0

3.0

4.0

Calculating Descriptive Statistics • EXERCISE 27 293

The mean is an appropriate measure of central tendency for approximately normally

distributed populations with variables measured at the interval or ratio level. It is also

appropriate for ordinal level data such as Likert scale values, where higher numbers represent

more of the construct being measured and lower numbers represent less of the

construct (such as pain levels, patient satisfaction, depression, and health status).

The mean is sensitive to extreme scores such as outliers. An outlier is a value in a

sample data set that is unusually low or unusually high in the context of the rest of the

sample data. An example of an outlier in the data presented in Table 27-1 might be a

value such as 11. The existing values range from 0.1 to 4.0, meaning that no veteran used

a biologic beyond 4 years. If an additional veteran were added to the sample and that

person used a biologic for 11 years, the mean would be much larger: 2.7 years. Simply

adding this outlier to the sample nearly doubled the mean value. The outlier would also

change the frequency distribution. Without the outlier, the frequency distribution is

approximately normal, as shown in Figure 27-1 . Including the outlier changes the shape

of the distribution to appear positively skewed.

Although the use of summary statistics has been the traditional approach to describing

data or describing the characteristics of the sample before inferential statistical analysis,

its ability to clarify the nature of data is limited. For example, using measures of central

tendency, particularly the mean, to describe the nature of the data obscures the impact

of extreme values or deviations in the data. Thus, signifi cant features in the data may be

concealed or misrepresented. Often, anomalous, unexpected, or problematic data and

discrepant patterns are evident, but are not regarded as meaningful. Measures of dispersion,

such as the range, difference scores, variance, and standard deviation ( SD ), provide

important insight into the nature of the data.

MEASURES OF DISPERSION

Measures of dispersion , or variability, are measures of individual differences of the

members of the population and sample. They indicate how values in a sample are dispersed

around the mean. These measures provide information about the data that is not

available from measures of central tendency. They indicate how different the scores are—

the extent to which individual values deviate from one another. If the individual values

are similar, measures of variability are small and the sample is relatively homogeneous

in terms of those values. Heterogeneity (wide variation in scores) is important in some

statistical procedures, such as correlation. Heterogeneity is determined by measures of

variability. The measures most commonly used are range, difference scores, variance, and

SD (see Exercise 9 ).

FIGURE 27-1 ■ FREQUENCY DISTRIBUTION OF YEARS OF BIOLOGIC

USE, WITHOUT OUTLIER AND WITH OUTLIER.

0

Frequency

Frequency

0-.9 1-1.9 2-2.9 3-3.9 4-4.9 0-0.9 1-1.9 2-2.9 3-3.9 4-4.9 5-5.9 6-6.9 7-7.9 8-8.9 9-9.9 10-10.9 11-11.9

Years of biologic use Years of biologic use

3.0

2.5

2.0

1.5

1.0

0.5

0

3.0

2.5

2.0

1.5

1.0

0.5

294 EXERCISE 27 • Calculating Descriptive Statistics

Range

The simplest measure of dispersion is the range . In published studies, range is presented

in two ways: (1) the range is the lowest and highest scores, or (2) the range is calculated

by subtracting the lowest score from the highest score. The range for the scores in Table

27-1 is 0.3 and 4.0, or it can be calculated as follows: 4.0 − 0.3 = 3.7. In this form, the

range is a difference score that uses only the two extreme scores for the comparison. The

range is generally reported but is not used in further analyses.

Difference Scores

Difference scores are obtained by subtracting the mean from each score. Sometimes a

difference score is referred to as a deviation score because it indicates the extent to which

a score deviates from the mean. Of course, most variables in nursing research are not

“scores,” yet the term difference score is used to represent a value ’ s deviation from the

mean. The difference score is positive when the score is above the mean, and it is negative

when the score is below the mean (see Table 27-2 ). Difference scores are the basis for many

statistical analyses and can be found within many statistical equations. The formula for

difference scores is:

−X

of absolute values:9.5

TABLE 27-2 DIFFERENCE SCORES OF DURATION OF

BIOLOGIC USE

X X – X

0.1 − 1.9 − 1.8

0.3 − 1.9 − 1.6

1.3 − 1.9 − 0.6

1.5 − 1.9 − 0.4

1.5 − 1.9 − 0.4

2.0 − 1.9 0.1

2.2 − 1.9 0.3

3.0 − 1.9 1.1

3.0 − 1.9 1.1

4.0 − 1.9 2.1

The mean deviation is the average difference score, using the absolute values. The

formula for the mean deviation is:

X

X X

deviation 

−

In this example, the mean deviation is 0.95. This value was calculated by taking the

sum of the absolute value of each difference score (1.8, 1.6, 0.6, 0.4, 0.4, 0.1, 0.3, 1.1, 1.1,

2.1) and dividing by 10. The result indicates that, on average, subjects ’ duration of biologic

use deviated from the mean by 0.95 years.

Variance

Variance is another measure commonly used in statistical analysis. The equation for a

sample variance (2 ) is below.

s

X X

n

2

2

1

−

Calculating Descriptive Statistics • EXERCISE 27 295

Note that the lowercase letter 2 is used to represent a sample variance. The lowercase

Greek sigma ( σ 2 ) is used to represent a population variance, in which the denominator is

instead of − 1. Because most nursing research is conducted using samples, not populations,

formulas in the subsequent exercises that contain a variance or standard deviation

will incorporate the sample notation, using − 1 as the denominator. Moreover, statistical

software packages compute the variance and standard deviation using the sample formulas,

not the population formulas.

The variance is always a positive value and has no upper limit. In general, the larger the

variance, the larger the dispersion of scores. The variance is most often computed to derive

the standard deviation because, unlike the variance, the standard deviation refl ects important

properties about the frequency distribution of the variable it represents. Table 27-3

displays how we would compute a variance by hand, using the biologic duration data.

s2 13 41

9

.

s ² 1.49

TABLE 27-3 VARIANCE COMPUTATION OF BIOLOGIC

USE

X X X – ( – X )2

0.1 − 1.9 − 1.8 3.24

0.3 − 1.9 − 1.6 2.56

1.3 − 1.9 − 0.6 0.36

1.5 − 1.9 − 0.4 0.16

1.5 − 1.9 − 0.4 0.16

2.0 − 1.9 0.1 0.01

2.2 − 1.9 0.3 0.09

3.0 − 1.9 1.1 1.21

3.0 − 1.9 1.1 1.21

4.0 − 1.9 2.1 4.41

Σ 13.41

Standard Deviation

Standard deviation is a measure of dispersion that is the square root of the variance.

The standard deviation is represented by the notation or SD . The equation for obtaining

a standard deviation is

SD

X

−

X

n

2

1

Table 27-3 displays the computations for the variance. To compute the SD , simply take

the square root of the variance. We know that the variance of biologic duration is

s2 = 1.49. Therefore, the of biologic duration is S D = 1.22. The S D is an important statistic,

both for understanding dispersion within a distribution and for interpreting the

relationship of a particular value to the distribution.

SAMPLING ERROR

A standard error describes the extent of sampling error. For example, a standard error

of the mean is calculated to determine the magnitude of the variability associated with

the mean. A small standard error is an indication that the sample mean is close to

296 EXERCISE 27 • Calculating Descriptive Statistics

the population mean, while a large standard error yields less certainty that the sample

mean approximates the population mean. The formula for the standard error of the mean

sX ) is:

s

s

n X

Using the biologic medication duration data, we know that the standard deviation of

biologic duration is = 1.22. Therefore, the standard error of the mean for biologic duration

is computed as follows:

sX 1 22

10

.

sX 0.39

The standard error of the mean for biologic duration is 0.39.

Confi dence Intervals

To determine how closely the sample mean approximates the population mean, the standard

error of the mean is used to build a confi dence interval. For that matter, a confi dence

interval can be created for many statistics, such as a mean, proportion, and odds ratio.

To build a confi dence interval around a statistic, you must have the standard error value

and the value to adjust the standard error. The degrees of freedom ( df ) to use to compute

a confi dence interval is df − 1.

To compute the confi dence interval for a mean, the lower and upper limits of that

interval are created by multiplying the sX by the statistic, where df − 1. For a 95%

confi dence interval, the value should be selected at α = 0.05. For a 99% confi dence interval,

the value should be selected at α = 0.01.

Using the biologic medication duration data, we know that the standard error of the

mean duration of biologic medication use is sX 0.39 . The mean duration of biologic

medication use is 1.89. Therefore, the 95% confi dence interval for the mean duration of

biologic medication use is computed as follows:

X sX t

1.89 0.392.26

1.89 0.88

As referenced in Appendix A , the value required for the 95% confi dence interval with

df = 9 is 2.26. The computation above results in a lower limit of 1.01 and an upper limit

of 2.77. This means that our confi dence interval of 1.01 to 2.77 estimates the population

mean duration of biologic use with 95% confi dence ( Kline, 2004 ). Technically and mathematically,

it means that if we computed the mean duration of biologic medication use

on an infi nite number of veterans, exactly 95% of the intervals would contain the true

population mean, and 5% would not contain the population mean ( Gliner, Morgan, &

Leech, 2009 ). If we were to compute a 99% confi dence interval, we would require the t

value that is referenced at α = 0.01. Therefore, the 99% confi dence interval for the mean

duration of biologic medication use is computed as follows:

1.89 0.393.25

1.89 1.27

Calculating Descriptive Statistics • EXERCISE 27 297

As referenced in Appendix A , the value required for the 99% confi dence interval with

df = 9 is 3.25. The computation above results in a lower limit of 0.62 and an upper limit

of 3.16. This means that our confi dence interval of 0.62 to 3.16 estimates the population

mean duration of biologic use with 99% confi dence.

Degrees of Freedom

The concept of degrees of freedom ( df ) was used in reference to computing a confi dence

interval. For any statistical computation, degrees of freedom are the number of independent

pieces of information that are free to vary in order to estimate another piece of

information ( Zar, 2010 ). In the case of the confi dence interval, the degrees of freedom are

− 1. This means that there are − 1 independent observations in the sample that are

free to vary (to be any value) to estimate the lower and upper limits of the confi dence

interval.

SPSS COMPUTATIONS

A retrospective descriptive study examined the duration of biologic use from veterans

with rheumatoid arthritis ( Tran et al., 2009 ). The values in Table 27-4 were extracted from

a larger sample of veterans who had a history of biologic medication use (e.g., infl iximab

[Remicade], etanercept [Enbrel]). Table 27-4 contains simulated demographic data collected

from 10 veterans who had stopped taking biologic medications. Age at study enrollment,

duration of biologic use, race/ethnicity, gender (F = female), tobacco use (F = former

use, C = current use, N = never used), primary diagnosis (3 = irritable bowel syndrome, 4

= psoriatic arthritis, 5 = rheumatoid arthritis, 6 = reactive arthritis), and type of biologic

medication used were among the study variables examined.

TABLE 27-4 DEMOGRAPHIC VARIABLES OF VETERANS WITH RHEUMATOID ARTHRITIS

Patient

ID

Duration

(yrs) Age Race/Ethnicity Gender Tobacco Diagnosis Biologic

1 0.1 42 Caucasian F F 5 Infl iximab

2 0.3 41 Black, not of

Hispanic Origin

F F 5 Etanercept

3 1.3 56 Caucasian F N 5 Infl iximab

4 1.5 78 Caucasian F F 3 Infl iximab

5 1.5 86 Black, not of

Hispanic Origin

F F 4 Etanercept

6 2.0 49 Caucasian F F 6 Etanercept

7 2.2 82 Caucasian F F 5 Infl iximab

8 3.0 35 Caucasian F N 3 Infl iximab

9 3.0 59 Black, not of

Hispanic Origin

F C 3 Infl iximab

10 4.0 37 Caucasian F F 5 Etanercept

298 EXERCISE 27 • Calculating Descriptive Statistics

This is how our data set looks in SPSS.

Step 1: For a nominal variable, the appropriate descriptive statistics are frequencies and

percentages. From the “Analyze” menu, choose “Descriptive Statistics” and “Frequencies.”

Move “Race/Ethnicity and Gender” over to the right. Click “OK.”

Calculating Descriptive Statistics • EXERCISE 27 299

Step 2: For a continuous variable, the appropriate descriptive statistics are means and

standard deviations. From the “Analyze” menu, choose “Descriptive Statistics” and

“Explore.” Move “Duration” over to the right. Click “OK.”

INTERPRETATION OF SPSS OUTPUT

The following tables are generated from SPSS. The fi rst set of tables (from the fi rst set of

SPSS commands in Step 1) contains the frequencies of race/ethnicity and gender. Most

(70%) were Caucasian, and 100% were female.

Frequencies

Frequency Table

RaceEthnicity

Frequency Percent Valid Percent Cumulative

Percent

Valid

Black, not of Hispanic Origin 3 30.0 30.0 30.0

Caucasian 7 70.0 70.0 100.0

Total 10 100.0 100.0

Gender

Frequency Percent Valid Percent Cumulative

Percent

Valid F 10 100.0 100.0 100.0

300 EXERCISE 27 • Calculating Descriptive Statistics

Descriptives

Statistic Std. Error

Duration of Biologic Use

1.890 .3860

Lower Bound 1.017

Upper Bound 2.763

1.872

1.750

1.490

1.2206

.1

4.0

3.9

2.0

.159 .687

-.437 1.334

Mean

95% Confidence Interval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

The second set of output (from the second set of SPSS commands in Step 2) contains the

descriptive statistics for “Duration,” including the mean, (standard deviation), SE , 95%

confi dence interval for the mean, median, variance, minimum value, maximum value,

range, and skewness and kurtosis statistics. As shown in the output, mean number of

years for duration is 1.89, and the SD is 1.22. The 95% CI is 1.02–2.76.

Explore

Calculating Descriptive Statistics • EXERCISE 27 301

STUDY QUESTIONS

1. Defi ne mean.

2. What does this symbol, s2 , represent?

3. Defi ne outlier.

4. Are there any outliers among the values representing duration of biologic use?

5. How would you interpret the 95% confi dence interval for the mean of duration of biologic use?

6. What percentage of patients were Black, not of Hispanic origin?

7. Can you compute the variance for duration of biologic use by using the information presented

in the SPSS output above?

302 EXERCISE 27 • Calculating Descriptive Statistics

8. Plot the frequency distribution of duration of biologic use.

9. Where is the median in relation to the mean in the frequency distribution of duration of biologic

use?

10. When would a median be more informative than a mean in describing a variable?

Duration of biologic use

Mean = 1.89

Std. Dev. = 1.221

= 10

3

2

1

0

0 1.0 2.0 3.0 4.0 5.0

Frequency

1. The mean is defi ned as the arithmetic average of a set of numbers.

2. s2 represents the sample variance of a given variable.

3. An outlier is a value in a sample data set that is unusually low or unusually high in the context

of the rest of the sample data.

4. There are no outliers among the values representing duration of biologic use.

5. The 95% CI is 1.02–2.76, meaning that our confi dence interval of 1.02–2.76 estimates the

population mean duration of biologic use with 95% confi dence.

6. 30% of patients were Black, not of Hispanic origin.

7. Yes, the variance for duration of biologic use can be computed by squaring the SD presented

in the SPSS table. The SD is listed as 1.22, and, therefore, the variance is 1.22 2 or 1.49.

8. The frequency distribution approximates the following plot:

9. The median is 1.75 and the mean is 1.89. Therefore, the median is lower in relation to the

mean in the frequency distribution of duration of biologic use.

10. A median can be more informative than a mean in describing a variable when the variable ’ s

frequency distribution is positively or negatively skewed. While the mean is sensitive to outliers,

the median is relatively unaffected.

Questions to Be Graded EXERCISE 27

grading. Alternatively, your instructor may ask you to use the space below for notes and submit your

1. What is the mean age of the sample data?

2. What percentage of patients never used tobacco?

3. What is the standard deviation for age?

4. Are there outliers among the values of age? Provide a rationale for your answer.

5. What is the range of age values?

Name: _______________________________________________________ Class: _____________________

Date: ___________________________________________________________________________________

306 EXERCISE 27 • Calculating Descriptive Statistics

6. What percentage of patients were taking infl iximab?

7. What percentage of patients had rheumatoid arthritis as their primary diagnosis?

8. What percentage of patients had irritable bowel syndrome as their primary diagnosis?

9. What is the 95% CI for age?

10. What percentage of patients had psoriatic arthritis as their primary diagnosis?

Essay

DOUBLE

APA