Development of a Large-scale Korean Language Model in the Field of Geosciences

Sang-ho Lee;

doi:10.9719/EEG.2024.57.5.539

Economic and Environmental Geology (자원환경지질)

Volume 57 Issue 5
/
Pages.539-550
/
2024
/
1225-7281(pISSN)
/
2288-7962(eISSN)

The Korean Society of Economic and Environmental Geology (대한자원환경지질학회)

DOI QR Code

Development of a Large-scale Korean Language Model in the Field of Geosciences

지질과학 분야 한국어 대규모 언어 모델 개발

Sang-ho Lee (Mineral Resources Division, Korea Institute of Geosciences and Mineral Resources)

이상호 (한국지질자원연구원 광물자원연구본부)

Received : 2024.08.30
Accepted : 2024.10.10
Published : 2024.10.29

https://doi.org/10.9719/EEG.2024.57.5.539 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

With the rapid development and commercialization of large-scale generative language models, concerns regarding the appropriateness of model outputs, expertise, and data security have been emerged. In particular, Korean generative language models specialized in the field of geoscience have not yet been studied due to difficulties in data processing, preprocessing and a lack of development cases. This study conducted the entire process for developing a Korean language model specialized in the field of geoscience and evaluated its applicability in related fields. To achieve this, academic data related to geoscience were collected and preprocessed to create a dataset suitable for the training of the language model. The dataset was applied to the Llama2 model for the training. The trained model was quantitatively evaluated using 19 different evaluation datasets from various fields. The results demonstrated improved functionalities related to scientific question-answering and Korean text interpretation compared to the original model. The language model developed through this study can potentially enhance research productivity in the field of geoscience, offering benefits such as idea generation. The outcomes of this study are expected to stimulate further research and the utilization of generative language models in geoscience in the future.

최근 대규모 생성형 언어 모델의 급격한 발달과 상용화가 이루어지면서 모델 출력의 적정성, 전문성 문제 및 데이터 보안 문제가 제기되고 있다. 특히 지질과학 유관 분야에서는 가공된 자료 및 전처리의 어려움과 개발 사례의 부족으로 인해 해당 분야에 특화된 한국어 언어 모델 개발은 아직 진행된 사례가 없다. 이에 따라 본 연구에서는 지질과학 분야에 특화된 한국어 언어 모델 개발을 위한 전반적인 과정을 수행하고 이를 평가함으로써 유관 분야에서의 적용 가능성을 알아보고자 하였다. 이를 위하여 지질과학 유관 분야의 학술 자료를 수집하고 전처리하여 언어 모델의 학습에 적합한 자료를 준비하고, 이를 Llama 2 모델에 적용하여 사전학습 및 미세조정을 수행하였다. 학습된 모델은 19종의 분야별 평가용 데이터셋을 이용하여 정량적으로 평가하였으며, 그 결과 원본 모델 대비 과학 관련 질의응답 및 및 한국어 지문 해석 관련 기능이 향상된 것으로 나타났다. 본 연구를 통해 개발된 언어 모델은 유관 분야에서 아이디어 창출과 같은 연구 생산성 제고에 기여할 수 있으며, 향후 언어 모델을 활용한 연구 및 활용을 활성화할 수 있을 것으로 기대된다.

Keywords

Acknowledgement

본 연구는 한국지질자원연구원 자체연구사업인 "지질자원분야 대규모 언어 모델 시범개발(23-7512)" 과제의 일환으로 수행되었습니다.

References

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger G., Henighan T., Child R., Ramesh A., Ziegler D., Wu J., Winter C., Hesse C., Chen M., Sigler E., Litwin M., Gray S., Chess B., Clark J., Berner C., McCandlish S., Radford A., Sutskever I. and Amodei, D. (2020) Language models are few-shot learners. Advances in Neural Information Processing Systems, v.33, p.1877-1901. https://doi.org/10.48550/arXiv.2005.14165
Deng, C., Zhang, T., He, Z., Xu, Y., Chen, Q., Shi, Y., Fu, L., Zhang, W., Wang, X., Zhou, C., Lin, Z. and He, J. (2024, March) K2: Learning A Foundation Language Model for Geoscience Knowledge Understanding and Utilization. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Association for Computing Machinery, p.161-170. https://doi.org/10.1145/3616855.3635772
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. and Chen, W. (2021) Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685. doi: 10.48550/arXiv.2106.09685
Lawley, C.J., Raimondo, S., Chen, T., Brin, L., Zakharov, A., Kur, D., Hui, J., Newton, G., Burgoyne, S.L. and Marquis, G. (2022) Geoscience language models and their intrinsic evaluation. Applied Computing and Geosciences, v.14, 100084. doi: 10.1016/j.acags.2022.100084
Lee. J. and Choi, T. (2023) Llama-2-KoEn-13B. doi: 10.57967/hf/1280
Lee, A.N., Hunter, C.J. and Ruiz, N. (2023) Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317. doi: 10.48550/arXiv.2308.07317
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W., Rocktaschel, T., Riedel, S. and Kiela, D. (2020) Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, v33, p.9459-9474. https://doi.org/10.48550/arXiv.2005.11401
Mukherjee, S., Mitra, A., Jawahar, G., Agarwal, S., Palangi, H. and Awadallah, A. (2023) Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707. doi: 10.48550/arXiv.2306.02707
Nasr, M., Carlini, N., Hayase, J., Jagielski, M., Cooper, F.A., Ippolito, D., Choquette-Choo, C.A., Wallace, E., Tramer, F. and Lee, K. (2023) Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035. doi: 10.48550/arXiv.2311.17035
Niederfahrenhorst, A., Hakhamaneshi, K. and Ahmad, R. (2023, September 6) Fine-Tuning LLMs: LoRA or Full-Parameter? An in-depth Analysis with Llama 2. Anyscale. https://www.anyscale.com/blog/fine-tuning-llms-lora-or-full-parameter-an-indepth-analysis-with-llama-2
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J. and Lowe, R. (2022) Training language models to follow instructions with human feedback. In Proceedings of the Advances in Neural Information Processing Systems 35 (NeurIPS 2022), Curran Associates, Inc., v.35, p.27730-27744. https://doi.org/10.48550/arXiv.2203.02155
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D. and Finn, C. (2023) Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290. doi: 10.48550/arXiv.2305.18290
Rasley, J., Rajbhandari, S., Ruwase, O. and He, Y. (2020, August) Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Association for Computing Machinery, p.3505-3506. https://doi.org/10.1145/3394486.3406703
Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Le Scao, T., Raja, A., Dey, M., Bari, M. S., Xu, C., Thakker, U., Sharma, S.S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N., Datta, D., Chang, J., Jiang, M.T., Wang, H., Manica, M., Shen, S., Yong, Z.X., Pandey, H., Bawden, R., Wang, T., Neeraj, T., Rozen, J., Sharma, A., Santilli, A., Fevry, T., Fries, J.A., Teehan, R., Bers, T., Biderman, S., Gao, L., Wolf, T. and Rush, A.M. (2021) Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207. doi: 10.48550/arXiv.2110.08207
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C.C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P.S., Lachaux, M., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E.M., Subramanian, R., Tan, X.E., Tang, B., Taylor, R., Williams, A., Kuan, J.X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S. and Scialom, T. (2023) Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. doi: 10.48550/arXiv.2307.09288
Wang, Y.E., Wei, G.Y. and Brooks, D. (2019) Benchmarking TPU, GPU, and CPU platforms for deep learning. arXiv preprint arXiv:1907.10701. doi: 10.48550/arXiv.1907.10701
Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J., Wen, J. (2023) A survey of large language models. arXiv preprint arXiv:2303.18223. doi: 10.48550/arXiv.2303.18223

Economic and Environmental Geology (자원환경지질)

Development of a Large-scale Korean Language Model in the Field of Geosciences

지질과학 분야 한국어 대규모 언어 모델 개발

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)