Objective To identify the risk factors related to lifestyle behaviors that affect the incidence of lung cancer, to build a lung cancer risk prediction model to identify, in the population, individuals who are at high risk, and to facilitate the early detection of lung cancer.
Methods The data used in the study were obtained from the UK Biobank, a database that contains information collected from 502389 participants between March 2006 and October 2010. Based on domestic and international guidelines for lung cancer screening and high-quality research literature on lung cancer risk factors, high-risk population identification criteria were determined. Univariate Cox regression was performed to screen for risk factors of lung cancer and a multifactor lung cancer risk prediction model was constructed using Cox proportional hazards regression. Based on the comparison of Akaike information criterion and Schoenfeld residual test results, the optimal fitted model assuming proportional hazards was selected. The multiple factor Cox proportional hazards regression was performed to consider the survival time and the population was randomly divided into a training set and a validation set by a ratio of 7:3. The model was built using the training set and the performance of the model was internally validated using the validation set. The area under the receiver operating characteristic (ROC) curve (AUC) was used to evaluate the efficacy of the model. The population was categorized into low-risk, moderate-risk, and high-risk groups based on the probability of occurrence of 0% to <25%, 25% to <75%, and 75% to 100%. The respective proportions of affected individuals in each risk group were calculated.
Results The study eventually covered 453558 individuals, and out of the cumulative follow-up of 5505402 person-years, a total of 2330 cases of lung cancer were diagnosed. Cox proportional hazards regression was performed to identify 10 independent variables as predictors of lung cancer, including age, body mass index (BMI), education, income, physical activity, smoking status, alcohol consumption frequency, fresh fruit intake, family history of cancer, and tobacco exposure, and a model was established accordingly. Internal validation results showed that 8 independent variables (all the 10 independent variables screened out except for BMI and fresh fruit intake) were significant influencing factors of lung cancer (P<0.05). The AUC of the training set for predicting lung cancer occurrence at one year, five years, and ten years were 0.825, 0.785, and 0.777, respectively. The AUC of the validation set for predicting lung cancer occurrence at one year, five years, and ten years were 0.857, 0.782, and 0.765, respectively. 68.38% of the individuals who might develop lung cancer in the future could be identified by screening the high-risk population.
Conclusion We established, in this study, a model for predicting lung cancer risks associated with lifestyle behaviors of a large population. Showing good performance in discriminatory ability, the model can be used as a tool for developing standardized screening strategies for lung cancer.