Rozemarijn Witkam, Health and Wellbeing, University of Manchester, 2019 Cohort
The research I am involved in aims to understand how social factors and obesity interact to influence the development and outcomes of arthritis in the UK. I am using data from The English Longitudinal Study of Aging, Understanding Society and the Norfolk Arthritis Register, three ongoing longitudinal observational datasets.
I am currently in the process of writing the analysis plan for my second objective: the relationship between the different definitions of obesity (e.g. defined either by BMI or waist circumference) and incident arthritis in the UK. I am making good progress and due to the course ‘Introduction to Dealing with Missing Data’ led by Chibueze Ogbonnaya and Eirini Koutoumanou, which I recently followed, I will soon be able to complete my analysis plan.
This two-day online course that was organized by The Centre for Applied Statistics Courses (CASC) at University College London (UCL) on 7 and 8 December 2020, provided me with more background information for my section on How to Handle Missing Data.
The first day of this course focused on not only the various reasons for and the implications of missing data, but also elucidated simple methods for dealing with missing data. As the course leaders explained, there are several reasons for missing data, which include sensitivity of variables (e.g. income), drop out (e.g. participants who drop out during a study) or lost to follow up (e.g. participants who move out of the area and do not leave an address). Categorization based on the type of missing data then follows. The different types of missing data are:
- Missing completely at random (MCAR): the probability that an observation is missing is unrelated both to the unobserved value itself and to the values of any other variables in the dataset;
- Missing at random (MAR): the probability that an observation is missing is unrelated to the unobserved value itself, after controlling for other variables in the dataset;
- Missing not at random (MNAR): even when accounting for all observed variables, there are still systematic differences between the missing and observed values.
The next step is dealing with missing data. Simple methods include listwise deletion (deleting all participants who have missing values) and pairwise deletion (deleting cells with missing values). If data is missing completely at random, these methods will not lead to bias; however, excluding substantial amounts of data will, of course, lead to loss in statistical power.
Missing values can also be imputed from observed data. Methods include last observation carried forward (can be used in longitudinal data, where a previous observation is used to impute the missing value); imputation of single mean (replace missing value with the mean of the observed values from other participants); and imputation of regression mean (use the regression formula to calculate missing value). However, these methods often lead to bias and underestimation of the standard error, as they do not account for the uncertainty of the missing values 1.
The second day of the course focused on the more complex methods of dealing with missing data, including single random imputation and multiple random imputation. These methods offer substantial improvements over the aforementioned ‘simpler’ methods of imputation as they do not aim to replace a missing value directly with observed values; instead, they are based on the idea that all observed values define a distribution with parameters that can be estimated from the sample. Random draws from this distribution can then be imputed. Consequently, the imputed values may all be different to any of the ones observed. This is a step forward to improve the variability of the imputed values.
This two-day course (for more information see www.ucl.ac.uk/stats-courses) came at just the right time for me. The main presenter, Chibueze, is a great teacher and has the ability of making complex issues easy-to-understand. He is also adept at creating an interactive online training by directly involving us through a number of live quizzes and creating breakout rooms that allowed us to discuss cases with peers. This training has given me the necessary additional knowledge on dealing with missing data to complete my analysis plan.
Reference
1. Sterne JA, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ (Clinical research ed). 2009;338:b2393.