Missing Data Replacement Methods in Different Scenarios
-
Abstract
Objective To compare the effect of different approaches of missing data replacement on the regression coefficient estimates r of “length of stay” on “hospital expenditure”. Methods Data were extracted from the medical records of patients with head and neck neoplasms who were admitted to Sichuan Cancer Hospital. R 3.4.1 was used for generating and processing simulated datasets. Various scenarios were established by setting up different proportions of missing data and missing mechanisms using Monte Carlo method. Three strategies were tested for replacing missing data: Complete Case method, Expectation Maximization (EM), and Markov Chain Monte Carlo method (MCMC). The regression coefficient estimates r of standardized “length of stay” on standardized logarithmic “hospital expenditure” were calculated using these strategies and compared with that of the original complete dataset, in terms of their accuracy (magnitude of differences in r) and precision (differences in the standard error of r). Results The three replacement methods were all acceptable (within the limit rc±0.5 s c) when missing data were generated using MAR (2∶1) mechanism, or less than 30% data were simulated as missing using the MCAR and MAR (1∶2) mechanism. The EM method had the best estimation precision. Conclusion Missing data replacement should consider the proportion of missing data and potential mechanisms involved.
-
-