Abstract
Background
Outliers can influence regression model parameters and change the direction of the estimated effect, over-estimating or under-estimating the strength of the association between a response variable and an exposure of interest. Identifying visit-level outliers from longitudinal data with continuous time-dependent covariates is important when the distribution of such variable is highly skewed.
Objectives
The primary objective was to identify potential outliers at follow-up visits using interquartile range (IQR) statistic and assess their influence on estimated Cox regression parameters.
Methods
Study was motivated by a large TEDDY dietary longitudinal and time-to-event data with a continuous time-varying vitamin B12 intake as the exposure of interest and development of Islet Autoimmunity (IA) as the response variable. An IQR algorithm was applied to the TEDDY dataset to detect potential outliers at each visit. To assess the impact of detected outliers, data were analyzed using the extended time-dependent Cox model with robust sandwich estimator. Partial residual diagnostic plots were examined for highly influential outliers.
Results
Extreme vitamin B12 observations that were cases of IA had a stronger influence on the Cox regression model than non-cases. Identified outliers changed the direction of hazard ratios, standard errors, or the strength of association with the risk of developing IA.
Conclusion
At the exploratory data analysis stage, the IQR algorithm can be used as a data quality control tool to identify potential outliers at the visit level, which can be further investigated.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Data from The Environmental Determinants of Diabetes in the Young (https://doi.org/10.58020/y3jk-x087) reported here will be made available for request at the NIDDK Central Repository (NIDDK-CR) website, Resources for Research (R4R), https://repository.niddk.nih.gov/.
Code availability
Statistical analysis code can be provided by the corresponding author upon a reasonable request.
References
Agresti A, Franklin CA, Klingenberg B. Statistics: the art and science of learning from data. 5th ed. Pearson; Essex, England; 2021.
McClave JT, Sincich TT. Statistics. 13th ed. Pearson Higher Ed; New Jersey, USA; 2017.
Aguinis H, Gottfredson RK, Joo H. Best-practice recommendations for defining, identifying, and handling outliers. Organ Res Methods. 2013;16:270–301.
Jones PR. A note on detecting statistical outliers in psychophysical data. Attention, perception, and psychophysics. Vol. 81. Springer New York LLC; New York, USA, 2019. p. 1189–96.
Leys C, Delacre M, Mora YL, Lakens D, Ley C. How to classify, detect, and manage univariate and multivariate outliers, with emphasis on pre-registration. Int Rev Soc Psychol. 2019;32:5.
Van den Broeck J, Cunningham SA, Eeckels R, Herbst K. Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Med. 2005;2:966–70.
Stasinopoulos MD, Rigby RA, Heller GZ, Voudouris V, Bastiani F De. Flexible regression and smoothing using GAMLSS in R. CRC Press; Boca Raton, FL, USA. 2017.
Rigby RA, Stasinopoulos MD, Heller GZ, Bastiani F De. Distributions for modeling location, scale, and shape: using GAMLSS in R. CRC Press; Boca Raton, FL, USA. 2020.
Yang J, Rahardja S, Fränti P. Outlier detection: how to threshold outlier scores? In: ACM International Conference Proceeding Series. Association for Computing Machinery; New York, USA, 2019.
Van der Meer T, Te Grotenhuis M, Pelzer B. Influential cases in multilevel modeling: a methodological comment. Am Socio Rev. 2010;75:173–8.
Yang S, Hutcheon JA. Identifying outliers and implausible values in growth trajectory data. Ann Epidemiol. 2016;26:77–80.e2.
Leys C, Ley C, Klein O, Bernard P, Licata L. Detecting outliers: do not use standard deviation around the mean, use absolute deviation around the median. J Exp Soc Psychol. 2013;49:764–6.
Phan HTT, Borca F, Cable D, Batchelor J, Davies JH, Ennis S. Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort. Sci Rep. 2020;10:10164.
Shi J, Korsiak J, Roth DE. New approach for the identification of implausible values and outliers in longitudinal childhood anthropometric data. Ann Epidemiol. 2018;28:204–11.e3.
Dugravot A, Sabia S, Shipley MJ, Welch C, Kivimaki M, Singh-Manoux A. Detection of outliers due to participants’ non-adherence to protocol in a longitudinal study of cognitive decline. PLoS One. 2015;10:e0132110.
Boone-Heinonen J, Tillotson CJ, O’Malley JP, Marino M, Andrea SB, Brickman A, et al. Not so implausible: impact of longitudinal assessment of implausible anthropometric measures on obesity prevalence and weight change in children and adolescents. Ann Epidemiol. 2019;31:69–74.e5.
Hazrati S, Hourigan SK, Waller A, Yui Y, Gilchrist N, Huddleston K, et al. Investigating the accuracy of parentally reported weights and lengths at 12 months of age as compared to measured weights and lengths in a longitudinal childhood genome study. BMJ Open. 2016;6:11653. https://doi.org/10.1136/bmjopen-2016-011653.
Farooqui T, Mustafa I, Christie T. Outliers in educational achievement data: their potential for the improvement of performance. Pak J Stat. 2014;30:71–82.
Voloh B, Watson MR, König S, Womelsdorf T. MAD saccade: statistically robust saccade threshold estimation via the median absolute deviation. J Eye Mov Res. 2019;12:1–12.
Chen Z, Song S, Wei Z, Fang J, Long J. Approximating median absolute deviation with bounded error. Proc VLDB Endow. 2021;14:2114–26. https://doi.org/10.14778/3476249.3476266.
Casella G, Berger RL. Statistical inference. 2nd ed. Duxbury; USA. 2002.
Rousseeuw PJ, Croux C. Explicit scale estimators with high breakdown point. In: Dodge Y, editor. L1-Statistical analysis and related methods. Y. Dodge, Amsterdam; North-Holland; 1992. p. 77–92.
TEDDY Study Group. The Environmental Determinants of Diabetes in the Young (TEDDY) Study. Ann N Y Acad Sci. 2008;1150:1–13. https://doi.org/10.1196/annals.1447.062.
Uusitalo U, Kronberg-Kippila C, Aronsson CA, Schakel S, Schoen S, Mattisson I, et al. Food composition database harmonization for between-country comparisons of nutrient data in the TEDDY Study. J Food Compos Anal. 2011;24:494–505.
Cox DR. Regression models and life tables (with discussion). J R Stat Soc B 1972;74:187–220.
Klein JP, Moeschberger ML. Survival analysis: techniques for censored and truncated data. 2nd ed. Springer; New York, USA. 2003.
Hosmer DW, Lemeshow S, May S. Applied survival analysis: regression modeling of time-to-event data. 2nd ed. John Wiley & Sons, Inc.; New Jersey, USA; 2008.
Lin DY, Wei LJ. The robust inference for the cox proportional hazards model. J Am Stat Assoc. 1989;84:1074–8.
Zeger SL, Liang K-Y. Longitudinal data analysis for discrete and continuous outcomes. Biometrics. 1986;42:121–30.
Willett WC, Howe GR, Kushi LH. Adjustment for total energy intake in epidemiologic studies. Am J Clin Nutr. 1997;65:1220S–1228S. discussion 1229S–1231S.
SAS Institute Inc. SAS Software 9.4 (SAS/STAT 15.2). Cary, NC, USA; 2016. http://www.sas.com/.
R Core Team. R: A language and environment for statistical computing. Vienna, Austria; 2023. https://www.r-project.org/.
StataCorp LLC Stata Statistical Software. College Station, TX: StataCorp LLC; 2023.
Acknowledgements
The authors would like to thank Sarah Austin-Gonzalez of the University of South Florida (USF)-Health Informatics Institute for editing and providing study information and support. We would also like to thank the reviewers for helping us to improve on the manuscript.
Funding
The TEDDY Study is funded by U01 DK63829, U01 DK63861, U01 DK63821, U01 DK63865, U01 DK63863, U01 DK63836, U01 DK63790, UC4 DK63829, UC4 DK63861, UC4 DK63821, UC4 DK63865, UC4 DK63863, UC4 DK63836, UC4 DK95300, UC4 DK100238, UC4 DK106955, UC4 DK112243, UC4 DK117483, U01 DK124166, U01 DK128847, and Contract No. HHSN267200700014C from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institute of Allergy and Infectious Diseases (NIAID), Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD), National Institute of Environmental Health Sciences (NIEHS), Centers for Disease Control and Prevention (CDC), and JDRF. This work is supported in part by the NIH/NCATS Clinical and Translational Science Awards to the University of Florida (UL1 TR000064) and the University of Colorado (UL1 TR002535). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
Author information
Authors and Affiliations
Contributions
Conceptualization: LKM and JY. Methodology: LKM, KFL, and XL. Software: LKM. Formal analysis: LKM. Resources: JPK. Data curation: LKM, JY, and UMU. Writing—original draft preparation: LKM. Writing—review and editing: LKM, XL, KFL, JY, CAA, SH, JMN, SMV, LH, UMU, and JPK. Supervision: KFL, XL, and JPK. Project administration: UMU and JMN. Funding acquisition: JMN, UMU, and JPK. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical approval
The TEDDY study was conducted in accordance with the Declaration of Helsinki, and approved by local US Institutional Review Boards and European Ethics Committee Boards, including the Colorado Multiple Institutional Review Board, Medical College of Georgia Human Assurance Committee (2004–2010), Georgia Health Sciences University Human Assurance Committee (2011–2012), Georgia Regents University Institutional Review Board (2013–2015), Augusta University Institutional Review Board (2015–present), University of Florida Health Center Institutional Review Board, Washington State Institutional Review Board (2004–2012), Western Institutional Review Board (2013–present), Ethics Committee of the Hospital District of Southwest Finland, Bayerischen Landesärztekammer (Bavarian Medical Association) Ethics Committee, Regional Ethics Board in Lund, Section 2 (2004–2012), and Lund University Committee for Continuing Ethical Review (2013–present).
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mramba, L.K., Liu, X., Lynch, K.F. et al. Detecting potential outliers in longitudinal data with time-dependent covariates. Eur J Clin Nutr 78, 344–350 (2024). https://doi.org/10.1038/s41430-023-01393-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41430-023-01393-6
This article is cited by
-
Intake of B vitamins and the risk of developing islet autoimmunity and type 1 diabetes in the TEDDY study
European Journal of Nutrition (2024)