DOI QR코드

DOI QR Code

Exploring the Reliability of an Assessment based on Automatic Item Generation Using the Multivariate Generalizability Theory

다변량일반화가능도 이론을 적용한 자동문항생성 기반 평가에서의 신뢰도 탐색

  • Received : 2023.06.26
  • Accepted : 2023.08.05
  • Published : 2023.08.31

Abstract

The purpose of this study is to suggest how to investigate the reliability of the assessment, which consists of items generated by automatic item generation using empirical example data. To achieve this, we analyzed the illustrative assessment data by applying the multivariate generalizability theory, which can reflect the design of responding to different items for each student and multiple error sources in the assessment score. The result of the G-study showed that, in most designs, the student effect corresponding to the true score of the classical test theory was relatively large after residual effects. In addition, in the design where the content domain was fixed, the ranking of students did not change depending on the item types or items. Similarly, in the design where the item format was fixed, the difficulty showed little variation depending on the content domains. The result of the D-study indicated that the original assessment data achieved a sufficient level of reliability. It was also found that higher reliability than the original assessment data could be obtained by reducing the number of items in the content domains of operation, geometry, and probability and statistics, or by assigning higher weights to the domains of letters and formulas, and function. The efficient measurement conditions presented in this study are limited to the illustrative assessment data. However, the method applied in this study can be utilized to determine the reliability and to find efficient measurement conditions for the various assessment situations using automatic item generation based on measurement traits.

본 연구의 목적은 예시 자료를 활용하여 자동문항생성을 기반으로 생성된 평가도구의 신뢰도를 산출하는 방안을 제시하는 데 있다. 이를 위해 학생들마다 다른 문항에 응답하는 설계와 평가 점수에 다중 오차요인을 반영할 수 있는 다변량일반화가능도이론을 예시 자료에 적용하여 분석하였다. G-연구분석 결과, 대부분의 설계에서 잔차 효과 다음으로 고전검사이론의 진점수에 해당하는 학생 효과가 크게 나타났다. 또한 문항 내용 영역을 고정한 설계에서 학생들의 상대적 순위는 문항 유형이나 문항에 따라 변하지 않았으며, 문항 유형을 고정한 설계에서 내용 영역에 따라 난이도는 거의 변화가 없는 것으로 나타났다. D-연구 분석 결과, 원자료는 적정수준 이상의 신뢰도를 확보하였으며, 수와 연산, 기하, 확률 및 통계 영역의 문항 수를 줄이거나 문자와 식과 함수 영역의 가중치를 높게 반영함으로써 원자료보다 높은 신뢰도를 산출할 수 있는 것으로 나타났다. 본 연구에서 제시한 효율적인 측정 조건은 예시 평가 자료에 제한되지만 본 연구에서 활용한 방법은 자동문항생성 기반의 다양한 평가 상황에서 측정학적 특성을 바탕으로 신뢰도를 산출하고, 효율적인 측정 조건을 탐색하는 데 적용 가능하다.

Keywords

Acknowledgement

이 성과는 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(No.2019R1F1A1059437, No.2022R1A2C1010310).

References

  1. Brennan, R. L. (2001a). Generalizability Theory. New York: Springer.
  2. Brennan, R. L. (2001b). Manual for mGENOVA (Version 2.1). [Computer software]. Iowa City, IA: University of Iowa.
  3. Choi, I. (2000). Prospects of computer adaptive testing and performance testing: Language testing in the 21st century. Language Research, 36(1), 205-241.
  4. Choi, J. (2020). Educational innovation in the digital age: A plan to implement an intelligent learning analysis platform based on big data. Educational Development, 214, 44-50.
  5. Choi, J., Kim, H., & Pak, S. (2018). Evaluation of automatic item generation utilities in formative assessment application for Korean high school students. Journal of Educational Issues, 4(1), 68-89.
  6. Choi, J., Kim, S., & Yoon, K. (2012-2023). CAFA Modeling Manual: Computer Adaptive Formative Assessment User's Guide [System Manual] (2nd edition). Clarksville, MD: CAFA Lab, Inc.
  7. Choi, J., Oh, K., Youn, K., Lee, D., Joung, J., Kim, S., Youm, S., Lee, Y., Lee, E., Park, W., & Lee, S. (2022). CAFA model example book. Seoul: Pubple.
  8. Churchill Jr, G. A. (1979). A paradigm for developing better measures of marketing constructs. Journal of marketing research, 16(1), 64-73. https://doi.org/10.1177/002224377901600110
  9. Cronbach, L. J., Linn, R. L., Brennan, R. L., & Haertel, E. H. (1997). Generalizability analysis for performance assessments of student achievement or school effectiveness. Educational and Psychological Measurement, 57(3), 373-399.
  10. Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991). Quality control in the development and use of performance assessments. Applied Measurement in Education, 4(4), 289-303.
  11. Falcao, F., Pereira, D. M., Goncalves, N., De Champlain, A., Costa, P., & Pego, J. M. (2023). A suggestive approach for assessing item quality, usability and validity of Automatic Item Generation. Advances in Health Sciences Education, 1-25.
  12. Gierl, M. J., & Lai, H. (2013). Instructional topics in educational measurement (ITEMS) module: Using automated processes to generate test items. Educational Measurement: Issues and Practice, 32(3), 36-50.
  13. Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. New York: Routledge.
  14. Jeong, H. (2009). A study on computer based test in education environment: Focused on students' experiences. Journal of Educational Technology, 25(4), 73-100. https://doi.org/10.17232/KSET.25.4.73
  15. Jeong, J., Shin, K., Lee, S., & Yoo, W. (2009). Design and implementation of iterative contents based on SCORM in mathematics. Proceeding of the Winter Conference of the Korea Society of Computer and Information, 16, 153-158.
  16. Kang, S., & Choi, S. (2020). Research on mathematical automatic item generation based on dynamic knowledge-base. Proceeding of the Korean Information Science Society Conference, 302-304.
  17. Keller, L. A., Clauser, B. E., & Swanson, D. B. (2010). Using multivariate generalizability theory to assess the effect of content stratification on the reliability of a performance assessment. Advances in Health Sciences Education, 15(5), 717-733.
  18. Kim, S. (2001). An analysis of sources of variation in the observational rating system-comparisons of observer agreement, interrater reliability, and generalizability theory. Journal of Educational Evaluation, 5(1), 37-56.
  19. Kim, S. (2014a). Exploring the application of generalizability theory to mathematics teacher evaluation for professional development in Korea based on the analysis of instructional quality assessment of mathematics teachers in the U.S. Communications of Mathematical Education, 28(4), 431-455. https://doi.org/10.7468/jksmee.2014.28.4.431
  20. Kim, S. (2014b). Exploring the application of multivariate generalizability theory to teacher evaluation for professional development in Korea based on the analysis of classroom observations in the US. The Journal of Korean Education, 41(1), 5-29.
  21. Kim, S. (2017). Multigroup generalizability analysis of Creative Attitude Scale-Korea for mathematically gifted and general students in middle schools. Communications of Mathematical Education, 31(1), 49-70. https://doi.org/10.7468/jksmee.2017.31.1.49
  22. Kim, S. (2022). The utility of digital evaluation based on automatic item generation in mathematics: focusing on the CAFA system. The Mathematical Education, 61(4), 581-596.
  23. Kim, S. (2023). Suggestions on the use of the CAFA system to promote teachers' digital competencies. The Journal of the Korea Contents Association, 23(4), 475-493. https://doi.org/10.5392/JKCA.2023.23.04.475
  24. Kim, S., & Berebitsky, D. (2016). An application of multivariate generalizability in selection of mathematically gifted students. Eurasia Journal of Mathematics, Science and Technology Education, 12(9), 2587-2598.
  25. Kim, S., & Choi, K. H. (2016). An investigation of efficient measurement conditions of the scientific creativity test. Secondary Education Research, 64(1), 49-75.
  26. Kim, S., & Chon, K. (2018). An Analysis of measurement errors and invariance properties by proficiency level in the non-cognitive measures. Journal of Curriculum Evaluation, 21(1), 153-172.
  27. Kim, S., & Han, K. (2014). Analysis of reliability coefficients depending on different domain weights in scoring teacher recommendation letters and self-introduction letters used in selection of mathematically gifted students. The Journal of the Korean Society for Gifted and Talented, 13(1), 43-66.
  28. Kim, S., & Kim, Y. (2001). Generalizability Theory. Seoul: Kyoyookbook.
  29. Kim, S., Song, M., & Park, I. (2012). Investigation on optimal conditions and error variance in standard setting using multivariate generalizability analysis. Journal of Educational Evaluation, 25(4), 679-700.
  30. Kim, S., Yeum, S., Chung, J., Yoon, K., & Park, S. (2023). Multivariate generalizability theory for reliability with item models: Industrial mathematics test example. NCME 2023 Annual Meeting, Chicago, IL: NCME.
  31. Lee, D. (2022, November 22). 4차 산업혁명 시대 따른 미래교육 모습은 [Future education in the era of the 4th industrial revolution]. Retrieved May 01, 2023, from http://m.wsobi.com/news/articleView.html?idxno=182590
  32. Lee, H. (2012). Multivariate generalizability analyses for mixed-format tests with various compositions of MC and CR item weights. Journal of Educational Evaluation, 25(1), 95-116.
  33. Lee, J., & Han, K. (2017). Finding optimal condition for conducting TTCT-Figural A (originality) using multivariate generalizability theory. The Journal of Creativity Education, 17(2), 57-77.
  34. Lee, J., Lee, S., & Ham, Y. (2022). Case study on college calculus education for vocational high school graduates with coding. Communications of Mathematical Education, 36(4), 611-626.
  35. Lee, S., Kim, S., Kim, J., Baek, K., & Lee, B. (2015). Analyses of the reliability of a preliminary creativity test using the multivariate generalizability theory. The Journal of Creativity Education, 15(3), 83-107.
  36. Lee, Y., & Shin, S. (2004). An investigation into the dependability of ratings in a German speaking test using the multivariate generalizability theory. Foreign Languages Education, 11(2), 259-265.
  37. Lim, H. (2017). Study on the design and properties of automatically generated items: focusing on polynomial factorization (Unpublished master thesis). Seoul National University, Seoul, Korea.
  38. Ministry of Education [MOE]. (2020). 코로나 이후, 미래교육 10대 정책과제 시안 발표 [Announcement of 10 policy tasks for future education after COVID-19]. Sejong, Korea. Retrieved from https://www.moe.go.kr/boardCnts/viewRenew.do?boardID=294&lev=0&statusYN=W&s=moe&m=020402&opType=N&boardSeq=82145
  39. Ministry of Education [MOE]. (2023). 디지털 기반 교육혁신 방안 [Digital-based education innovation plan]. Sejong, Korea. Retrieved from https://www.moe.go.kr/boardCnts/viewRenew.do?m=060209&s=moe&page=1&boardID=409&boardSeq=94072&lev=0&opType=N
  40. Oh, K. (2022). A study on the development of reading assessment using automatic item generation -Focused on reading comprehension item model-. Journal of CheongRam Korean Language Education, 87, 7-34.
  41. Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performance assessments. Journal of Educational Measurement, 30(3), 215-232.
  42. Song, I., & Kim, S. (2012). A validation study of epistemological belief using multivariate generalizability. The Korean Journal of Educational Methodology Studies, 24(1), 107-130.
  43. Vallevand, A. L. (2008). Reliability, validity and sources of errors in assessing physician performance in an Oobjective Structured Clinical Examination: A generalizability theory analysis (Unpublished doctoral dissertation). Calgary, Alberta, Canada.
  44. Webb, N. M., Shavelson, R. J., & Maddahian, E. (1983). Multivariate generalizability theory. In L. J. Pyans, Jr. (Eds.), Generalizability theory: Inferences and practical applications (pp.49-66). San Francisco, CA: Jossey-Bass.
  45. Wilhelm, A. G., & Kim, S. (2015). Generalizing from observations of mathematics teachers' instructional practice using the instructional quality assessment. Journal for Research in Mathematics Education, 46(3), 270-279.