Exploratory Data Analysis
Chapter4/Chapter Outlines.pdf
IBM SPSS for Introductory Statistics: Use and Interpretation, 5th Ed. (Morgan, Leech, Gloeckner & Barrett) Instructor’s Manual by Gene W. Gloeckner and Don Quick
Chapter 4 – Understanding Your Data and Checking Assumptions Chapter Outline
I. Exploratory Data Analysis (EDA)
A. What is EDA? 1. The first step to complete after entering data and before running
any inferential statistics. 2. Computing various descriptive statistics and graphs in order to
examine your data. a. Look for data errors, outliers, non-normal distributions, etc. b. Determine if the data meets the assumptions of the statistics
you plan to use. c. Gather basic demographic information about the subjects. d. Examine relationships between the variables to determine
how to conduct the hypothesis testing. B. How to do EDA
1. Generate plots of the data 2. Generate numbers from the data.
C. Check for Errors 1. Examine raw data before entering. 2. Compare some raw data against entered data. 3. Compare maximum and minimum values against the allowable
ranges. 4. Examine the means and standard deviations to see if they seem
reasonable. 5. Look to see if there is an unreasonable amount of missing data. 6. Look for outliers.
D. Statistical Assumptions: explain when it is and isn’t reasonable to perform a specific statistical test.
1. Parametric tests a. Usually have more assumptions than nonparametric tests. b. Generally designed for use with data that exhibits
approximately normal distribution c. S.some parametric tests are more robust in dealing with
violations of assumptions than others. 2. Nonparametric tests
a. Have fewer assumptions b. Can often be used when assumptions for parametric tests
are violated. E. Parametric Tests
a.
Chapter4/Extra SPSS Problems.pdf
IBM SPSS for Introductory Statistics: Use and Interpretation, 5th Ed. (Morgan, Leech, Gloeckner & Barrett) Instructor’s Manual by Gene W. Gloeckner and Don Quick
Chapter 4 – Understanding Your Data and Checking Assumptions Using the College Student data file, do the following problems. Print your outputs and circle the key parts of the output that you discuss. 4.1 For the variables with five or more ordered levels, compute the skewness.
Describe the results. Which variables in the data set are approximately normally distributed/scale? Which ones are ordered but not normal?
• Select Analyze => Descriptive Statistics => Descriptives. • Move student height, same sex parent’s height, amount of tv watched per week,
hours of study per week, student’s current gpa, positive evaluation-institution, positive evaluation-major, positive evaluation-facilites, positive evaluation-social life, hours per week spent working in the Variables box.
• Options => Check Skewness (in addition to Mean, Std. Deviation, Minimum, and Maximum) => Continue => OK.
The Valid N (listwise) for the variables selected is 48. The Means all seem reasonable and within the expected range. The Minimum and Maximum values are all with the expected range, based on the codebook. The N for each variable makes sense and only two variables are missing values (positive evaluation-major and hours per week spent working).
The Skewness Statistic is utilized to determine which of these variables are approximately normally distributed. The guideline is that if the Skewness Statistic is between -1 and 1, the variable is at least approximately normal. In this case, all the variables with five or more ordered levels fall into that range and would be considered approximately normally distributed. For this dataset, the ordinal variables with five or more ordered levels (positive evaluation-institution, positive evaluation-major, positive evaluation-facilities, positive evaluation-social life) are all approximately normally distributed and we can assume they are more like scale variables and we can use inferential statistics that have the assumption of normality with them. None of the variables examined for this problem were not normal. 4.3 Which variables are nominal? Run frequencies for the nominal variables and
other variables with fewer than five levels. Comment on the results.
• Select Analyze => Descriptive Statistics => Frequencies. • Move gender of student, marital status, age group, does subject have children,
television shows-sitcoms, television shows-movies, television shows-sports, television shows-news shows
The table titled Statistics provides the number of participants for whom we have Valid data and the number of Missing data. No other statistics were requested because almost all of them are not appropriate to use with nominal and dichotomous data. Age group has three ordered levels so it is ordinal and the median would be appropriate.
IBM SPSS for Introductory Statistics: Use and Interpretation, 5th Ed. (Morgan, Leech, Gloeckner & Barrett) Instructor’s Manual by Gene W. Gloeckner and Don Quick
The other tables are labeled Frequency Table and there is one for each of the variables selected. The left-hand column shows the Valid categories (or levels or values), Missing values, and Total number of participants. The Frequency column gives the number of participants who had each value. The Percent column is the percent who had each value, including missing values. For example, in the marital status table, 40.0% of ALL participants were single, 36.0% were married, 22.0% were divorced, and 2.0% were missing. The Valid Percent shows the percent of those with nonmissing data at each value; e.g. 40.8% of the 49 students with valid data were single. Finally, Cumulative Percent is the percent of the subjects in a category plus the categories listed above it.
Fig. E.8
Fig. E.9
IBM SPSS for Introductory Statistics: Use and Interpretation, 5th Ed. (Morgan, Leech, Gloeckner & Barrett) Instructor’s Manual by Gene W. Gloeckner and Don Quick
Ch. 4 Output 4.1 DESCRIPTIVES VARIABLES=height pheight hrstv hrsstudy currgpa evalinst evalprog evalphys evalsocl hrswork /STATISTICS=MEAN STDDEV MIN MAX SKEWNESS.
Descriptives
Fig. E.10
Descriptive Statistics
N Minimum Maximum Mean Std. Deviation Skewness Statistic Statistic Statistic Statistic Statistic Statistic Std. Error
student height in inches 50 60.00 75.00 67.3000 3.93959 .163 .337
same sex parent’s height 50 58.00 76.00 66.7800 5.10418 .333 .337
amount of tv watched per
week
50 4 25 11.98 6.096 .645 .337
hours of study per week 50 2 38 15.62 8.310 .950 .337
student’s current gpa 50 2.4 4.0 3.172 .3907 .147 .337
positive evaluation,
institution
50 2 5 3.38 .945 .059 .337
positive evaluation, major 49 1 5 3.27 .953 -.115 .340
positive evaluation, facilities 50 1 5 2.76 1.061 -.136 .337
positive eval, social life 50 1 5 3.10 1.182 .031 .337
hours per week spent
working
49 0 50 26.12 14.857 -.516 .340
Valid N (listwise) 48
IBM SPSS for Introductory Statistics: Use and Interpretation, 5th Ed. (Morgan, Leech, Gloeckner & Barrett) Instructor’s Manual by Gene W. Gloeckner and Don Quick
Ch. 4 Output 4.3 FREQUENCIES VARIABLES=gender marital age children tvsitcom tvmovies tvsports tvnews /ORDER=ANALYSIS.
Frequencies
Frequency Table
gender of student
Frequency Percent Valid Percent
Cumulative
Percent
Valid males 26 52.0 52.0 52.0
females 24 48.0 48.0 100.0
Total 50 100.0 100.0
marital status
Frequency Percent Valid Percent
Cumulative
Percent
Valid single 20 40.0 40.8 40.8
married 18 36.0 36.7 77.6
divorced 11 22.0 22.4 100.0
Total 49 98.0 100.0 Missing System 1 2.0 Total 50 100.0
IBM SPSS for Introductory Statistics: Use and Interpretation, 5th Ed. (Morgan, Leech, Gloeckner & Barrett) Instructor’s Manual by Gene W. Gloeckner and Don Quick
age group
Frequency Percent Valid Percent
Cumulative
Percent
Valid less than 22 17 34.0 34.0 34.0
22-29 18 36.0 36.0 70.0
30 or more 15 30.0 30.0 100.0
Total 50 100.0 100.0
does subject have children
Frequency Percent Valid Percent
Cumulative
Percent
Valid no 24 48.0 48.0 48.0
yes 26 52.0 52.0 100.0
Total 50 100.0 100.0
television shows-sitcoms
Frequency Percent Valid Percent
Cumulative
Percent
Valid no 18 36.0 36.0 36.0
yes 32 64.0 64.0 100.0
Total 50 100.0 100.0
television shows-movies
Frequency Percent Valid Percent
Cumulative
Percent
Valid no 32 64.0 64.0 64.0
yes 18 36.0 36.0 100.0
Total 50 100.0 100.0
IBM SPSS for Introductory Statistics: Use and Interpretation, 5th Ed. (Morgan, Leech, Gloeckner & Barrett) Instructor’s Manual by Gene W. Gloeckner and Don Quick
television shows-sports
Frequency Percent Valid Percent
Cumulative
Percent
Valid no 24 48.0 48.0 48.0
yes 26 52.0 52.0 100.0
Total 50 100.0 100.0
television shows-news shows
Frequency Percent Valid Percent
Cumulative
Percent
Valid no 27 54.0 54.0 54.0
yes 23 46.0 46.0 100.0
Total 50 100.0 100.0