欢迎来到《四川大学学报(医学版)》
邱建青, 周雨秋, 岳廷妍等. 不同缺失场景下各缺失值不同处理方法的结果比较[J]. 四川大学学报(医学版), 2018, 49(3): 430-435.
引用本文: 邱建青, 周雨秋, 岳廷妍等. 不同缺失场景下各缺失值不同处理方法的结果比较[J]. 四川大学学报(医学版), 2018, 49(3): 430-435.
QIU Jian-qing, ZHOU Yu-qiu, YUE Ting-yan. et al. Missing Data Replacement Methods in Different Scenarios[J]. Journal of Sichuan University (Medical Sciences), 2018, 49(3): 430-435.
Citation: QIU Jian-qing, ZHOU Yu-qiu, YUE Ting-yan. et al. Missing Data Replacement Methods in Different Scenarios[J]. Journal of Sichuan University (Medical Sciences), 2018, 49(3): 430-435.

不同缺失场景下各缺失值不同处理方法的结果比较

Missing Data Replacement Methods in Different Scenarios

  • 摘要: 目的 收集四川省肿瘤医院头颈部肿瘤患者住院病案信息数据,探讨不同缺失场景下数据缺失值通过完成者数据集法、期望-极大化法(EM)、马尔可夫链-蒙特卡洛法(MCMC)3种方法处理后的标准化住院天数对标准化住院费用对数值的回归系数估计值r的优劣。方法 运用R 3.4.1软件,采用蒙特卡洛模拟,通过设定缺失比例和缺失机制模拟不同场景的缺失数据集,运用完成者数据集法、期望-极大化法、马尔可夫链-蒙特卡洛法估计不同缺失场景的模拟数据集中标准化住院天数对标准化住院费用对数值的回归系数估计值r,并与完整数据集的回归系数估计值rc结果进行比较,从准确度(各种方法估计的rrc比较)和精确度(各种方法的r的变异程度s)两个角度进行评价。结果 3种缺失值处理方法的优劣在不同的缺失场景中均有所差异,完全随机缺失(MCAR)和随机缺失(MAR)(1∶2)机制下,当缺失比例小于30%时3种方法的估计值r均在可接受范围(rc±0.5s c);MAR(比例=2∶1)机制任意缺失比例下3种方法的估计值r均在可接受范围内;任意缺失场景下用EM法估计的r的标准误s 最小,且与rc的标准误s c最为接近。结论 在选择缺失值处理方法时,应该考虑数据的缺失比例和缺失机制。

     

    Abstract: Objective To compare the effect of different approaches of missing data replacement on the regression coefficient estimates r of “length of stay” on “hospital expenditure”. Methods Data were extracted from the medical records of patients with head and neck neoplasms who were admitted to Sichuan Cancer Hospital. R 3.4.1 was used for generating and processing simulated datasets. Various scenarios were established by setting up different proportions of missing data and missing mechanisms using Monte Carlo method. Three strategies were tested for replacing missing data: Complete Case method, Expectation Maximization (EM), and Markov Chain Monte Carlo method (MCMC). The regression coefficient estimates r of standardized “length of stay” on standardized logarithmic “hospital expenditure” were calculated using these strategies and compared with that of the original complete dataset, in terms of their accuracy (magnitude of differences in r) and precision (differences in the standard error of r). Results The three replacement methods were all acceptable (within the limit rc±0.5 s c) when missing data were generated using MAR (2∶1) mechanism, or less than 30% data were simulated as missing using the MCAR and MAR (1∶2) mechanism. The EM method had the best estimation precision. Conclusion Missing data replacement should consider the proportion of missing data and potential mechanisms involved.

     

/

返回文章
返回