4 Describing Data and Results – in development
You have already identified (step 1) your aims, factors and variables in relation to measuring the system that you are investigating.
ACTION: You now need to
identify the statistical methods that you can use to describe and summarize your data and any results..
If you have multivariate data (i.e. more than one response variable) then you should also look at Step 8.
Set of replicate measurements:
A set of replicate data values is a number of measurements of the 'same thing', e.g. several measurements of pH from the same place in a river, salaries of randomly selected workers in the same factory.
The 'average' or 'middle' value is best given by:
> Mean for normally distributed data or for data that is not skewed heavily in one direction
> Median for data that may not be distributed normally, e.g. typical distribution of salaries is skewed, with a long tail, towards high values.
The 'spread' of data values is best given by:
> Standard deviation for normally (or near normally) distributed data
> Interquartile range for any type of data
> Maximum and minimum values and the range are only useful in specific cases where limits are important
> Outliers are data values that are so far from the 'middle' value that they might be errors or anomalies.
Notes:
The boxplot is a graphical picture of the interquartile range, median, extreme values and outliers, and is an excellent way of illustrating any data set.
The 'best-estimate' value of the variable being sampled, together with its uncertainty, is best given by:
> Mean plus either standard error of the mean (standard uncertainty) or confidence interval for normal data.
> Median plus confidence interval of the median for non-normal (particularly skewed) data.
Notes:
Confidence interval of the median can be calculated using Minitab
Distribution of replicated data values is best described by
> Frequency column graph of data values gives a direct visual representation
> Skewness measures the extent to which the data is not symmetrical about the median value
> Kurtosis measured the extent to which the data might be peaked or flattened.
> Results of a normality test - see Testing
Notes:
Values of skewness and particularly kurtosis are only useful for large data
sets, and would not be routinely used.
Performing a normality test can be useful to check whether there may be a significant deviation in the data from a normal distribution.
Variation of one variable with respect to another:
Variation of a scale variable wrt to another scale variable
> X-y scatter graph with the response (dependent) variable on the y-axis
Notes:
Do NOT use a line graph in Excel - the x-axis has categorical values and not scale values
Do NOT join the data values with a line. A 'best-fit' line can be used - see modelling.
Error bars can (should) be given usually using one (sometimes two) standard deviations, but must be specified in the key.
Variation of a scale variable wrt to an ordinal variable
> Line graph
> X-y graph
Notes:
Variation of a scale variable wrt to an nominal variable
> Variety of options are available - column graph, pie chart
Notes:
Frequency data
Cross-tabulation
Multifactorial data
Factor plots
Interaction plots