The impact of cohort study design on modelling change in an outcome: a simulation study

Howe, Laura

INTRODUCTION: Cohort studies generally collect repeated measurements of participant characteristics, facilitating research into the life course development of health and disease. We explored the impact of alternative data collection schedules on the ability to analyse repeated measures data, e.g. is it preferable to have balanced or unbalanced data, the same number of measurements for all participants or a subset with very detailed data and the rest with less intensive data collection?

METHODS: We simulate data according to linear, curvilinear and U-shaped functions, using observed cohort data to select realistic model parameters and fractional polynomials to model curves. After simulating 1,000 datasets for 1,000 participants with data at frequent intervals, we draw from these datasets to mimic several study designs, assessing bias and residual variance.

RESULTS: For a linear model, most study designs (e.g. 3 balanced or unbalanced measures, 5 or 10% of participants with frequent measures and 2 or 3 measures per person for the remainder) were unbiased with respect to fixed effects, random effects and correlations between random effects. Three measures per person should be necessary for this linear model, but our results suggest that using unbalanced data means it is possible to fit a random-slopes model with 2 measures per person, although the residual variance is underestimated. Having a small proportion of the cohort with frequent follow-up and the remainder with only a few measurements appeared to be a suitable strategy when modelling a linear process, but for curvilinear and U-shaped models this appeared problematic.

CONCLUSIONS: Our results imply that a random-slopes linear model can be estimated with just 2 unbalanced measures per person if the main interest lies with the fixed effects; this has implications for the cost-effectiveness of cohort study designs and the use of routine data in longitudinal studies, where unbalanced data is likely.