DOI QR코드

DOI QR Code

Development of Supervised Machine Learning based Catalog Entry Classification and Recommendation System

지도학습 머신러닝 기반 카테고리 목록 분류 및 추천 시스템 구현

  • Received : 2018.08.30
  • Accepted : 2018.12.01
  • Published : 2019.02.28

Abstract

In the case of Domeggook B2B online shopping malls, it has a market share of over 70% with more than 2 million members and 800,000 items are sold per one day. However, since the same or similar items are stored and registered in different catalog entries, it is difficult for the buyer to search for items, and problems are also encountered in managing B2B large shopping malls. Therefore, in this study, we developed a catalog entry auto classification and recommendation system for products by using semi-supervised machine learning method based on previous huge shopping mall purchase information. Specifically, when the seller enters the item registration information in the form of natural language, KoNLPy morphological analysis process is performed, and the Naïve Bayes classification method is applied to implement a system that automatically recommends the most suitable catalog information for the article. As a result, it was possible to improve both the search speed and total sales of shopping mall by building accuracy in catalog entry efficiently.

200 만명 이상의 회원을 보유하고 있는 "도매꾹" B2B 온라인 쇼핑몰인 경우70% 이상의 시장 점유율로 하루에 80만개 이상의 아이템이 판매되고 있다. 하지만, 동일하거나 유사한 물품이 서로 다른 카탈로그 엔트리에 저장 및 등록되어 있기 때문에 구매자가 아이템을 검색하는 과정에서 어려움을 느끼며 B2B 대형 쇼핑몰 관리에도 문제점이 발생하고 있다. 따라서 이에 대한 해결 방안으로 본 연구에서는 대단위 쇼핑몰 구매 정보를 기반으로 지도-학습 머신러닝 기법을 적용하여 상품에 대한 카탈로그 목록 자동 분류 및 추천 시스템을 개발하였다. 구체적으로 판매자가 자연어 형태로 물품 등록 정보를 입력하면 KoNLPy 형태소 분석 과정을 수행하였으며, Naïve Bayes 분류 방식을 응용하여 물품에 가장 적합한 카탈로그 정보를 자동으로 추천해주는 시스템을 구현하였다. 정확도가 향상된 카테고리 목록을 구축하여 결과적으로 검색 속도와 쇼핑몰 매출을 향상시키는 효과가 있었다.

Keywords

1. Introduction

In addition to the development of the Internet, the use of online shopping malls is very high in B2B transactions. Especially wholesale site of GNG Commerce Co., Ltd[1] which applies the result of this study is for a B2B on-line shopping mall with over 2 million registered members. However, previous B2B web site provide a interface in which users select and register catalog entry classification for product manually. As s result, existing passive systems of previous online shopping malls do not provide convenience to users (B2B buyer), and same and similar items are regis-tered on different catalog in appropriately. Therefore, we want to develop an automated catalog recommendation sys-tem that solves these problems and improves the accuracy of the products catalog, the efficiency of search goods in the online shopping mall, and the increase in the total sales of B2B mall as a result.

In this paper, we present a brief introduction to supervised machine learning for system development based on py thon[2] and suggest the process of analyzing and extracting meaning ful information from item name entered by the user(B2B seller) using a morpheme analyzer with KoNLPy [3]using Natural Language Processing as a &ld quo;semi-supervised machine learning.” We propose the process of classifying categories by learning data using Naïve Bayes classification [4, 5] function. In addition, preprocessing method applied di-rectly to the actual B2B online shopping mall site and the structure of automatic classification algorithm based on Naive Bayes are designed in detail and its experimental results arepresented respectively.

2. Machine Learning-based catalog entry Auto-Recommendation

2.1 catalog Entry Management and its Problems

Currently, B2B large-scale online shopping mall such as GNG commerce Co., Ltd. is manually performing the process of manually selecting and registering categories as shown in the following figure. Domeggook.com site on G NG Commerce Co., Ltd is a No. 1 B2B on-line wholesale shop-ping mall in Rep. of Korea (On-line wholesale market share: 70%). There are over 2 million of registered member with 800,000 items sold per one day.

OTJBCD_2019_v20n1_57_f0002.png 이미지

(Figure 1) Domeggook B2B Shopping Mall & its Passive Catalog Entry Registration Process

On Domeggook shopping mall, each seller enrolls theirown item for sale. Then, buyers search for items. In detail, each seller’s items are enrolled on Domeggook’s online cata-log DB. Then, Buyers search for items based on this catalog.

OTJBCD_2019_v20n1_57_f0001.png 이미지

(Figure 2) Existing Item Enrollment Process on Domeggook B2B Shopping Mall from B2B Seller & Search Item Process for B2B Buyer

However, similar items are enrolled on different catalogentry. Therefore, it is difficult to manage catalog system by Domeggook administrator. For example, exactly same items were enrolled on different catalog entry (Back pack/Leisureand Sack/Computer catalog entry as follow figure). So, it is also difficult to search appropriate item by each buyer.

OTJBCD_2019_v20n1_57_f0003.png 이미지

(Figure 3) Existing Problems on Item Enrollment and Search Process on Domeggook B2B Shopping Mall

In the case of the existing system [7, 8, 9, 10], since the process is a passive catalog entry selection method, the regis-tered product may be enrolled in an inappropriate catalog en-try and may be re-registered with a previously registered cat-alog entry. Therefore, it is necessary to develop an advanced system for classifying and recommending the catalog entry most suitable for a product name based on a list of existing categories to be added in the process of registering into item catalog entry DB of online shopping mall by a B2B mer-chant (seller).

2.2 catalog Entry Recommendation with SML

As a solution, we developed catalog entry auto-classi-fication and recommendation system using Supervised Machine-Learning (SML) [6, 7]. If a seller(B2B merchant) in-put item’s name as a natural language format, then proposed system automatically recommends appropriate catalog entryset for appropriate enrollment on item DB.

If catalog entry auto-recommendation mechanism is im-plemented, same and similar item can be registered on exactand appropriate catalog (Advantage for seller). And then, itis much more fast and easy to find wanted items from wholes ale on-line shopping mall (Advantage for buyer). Additionally, it is possible for the B2B admin to maintaincatalog system efficiently (Advantage for B2B site admin-istrator).

OTJBCD_2019_v20n1_57_f0004.png 이미지

(Figure 4) Suggested Catalog Entry Recommendation System on Wholesale B2B Shopping Mall

For this purpose, it is necessary to directly pre-process the product (item) names related to existing catalog entry on developing a system for classification and recommendation. Toovercome the problems of catalog entry classification and recommendation in large wholesale online shopping mall, we use machine learning method to efficiently classify categories based on existing registered data and automatically recom-mend the most suitable catalog entry. Therefore, we devel-oped an efficient catalog entry automatic recommendationsystem using supervised machine learning model specifically.

3. Natural Language Processing and Morphological Analysis for SML

3.1 Semi-Supervised Machine Learning Model

Machine learning is divided into Supervised Learning, Unsupervised Learning, and Reinforcement Learning models. In order to apply the machine learning method to the auto-matic recommendation of the catalog entry information about the large items sold in the B2B online shopping mall, the machine learning method using the supervised learning mod-el is applied in this study.

The supervised learning model is mainly used for prediction. Basically, it is a method of learning by usinglearning data using machine learning algorithm and predict-ing new data based on it. Semi-supervised learning is a classof machine learning tasks and techniques that also make use of unlabeled data for training typically a small amount of la-beled data with a large amount of unlabeled data. Semi-supervised learning [11, 12, 13, 14] falls between unsupervised learning (without any labeled training data and supervised learning (with completely labeled training data). The acquis-ition of labeled data for a learning problem often requires askilled human agent. The cost associated with the labeling process thus may render a fully labeled training set infeasible, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value. Semi-supervised learning is also of theoreticalinterest in machine learning and as a model for human learning.

OTJBCD_2019_v20n1_57_f0005.png 이미지

(Figure 5) Supervised Machine Learning Model

3.2 Morphological Analysis on Catalog Entry

The user performs a process of registering the article to be registered in the online shopping mall in the name of anatural language type article. Therefore, for the items regis-tered in the large online shopping mall, a natural language processing process is required as a preprocessing process. Natural language processing generally refers to a set of techniques for analyzing, extracting, and understanding meaningful information in text. n case of English words in naturallanguage processing, we can divide morphemes based on white space and perform additional analysis easily by per-forming exception processing only for some limited casessuch as "Can’t" or "Didn’t".

However, in the case of the Korean language, it is moredifficult to process the English sentence than the natural lan-guage because the search and the mother language are addi-tionally attached. Therefore, for the accurate and efficient processing of Korean information, natural language process-ing was performed through morphological analysis of parts by using KoNLPy's Twitter analyzer, a collection of Python packages [3].

(Figure 6) Example of Korean Morpheme Analysis

4. KoNLPy and Naïve Bayes Classifier

4.1 KoNLPy[3] and Twitter Analyzer

KoNLPy has adopted the GPL v3 license as open sources of tware. Also, as the name implies, it is a package using Python and contains various analyzers (Hannanum, Kkma, Komoran, Mecab, Twitter) inside the software. Therefore, more accurate morphological analysis result can be obtained by applying the analyzer according to the characteristics ofinput values.

One of the analyzers included in the KoNLPy described above, the Twitter analyzer provides precise identification of parts of speech. Since there is a characteristic that the prod-uct information is additionally written in the case of the ar-ticle information registered in the large-capacity online shop-ping mall, the remaining parts are removed using the Twitteranalyzer provided by KoNLPy except for the noun and English. Through this, it was possible to extract noun in-formation that is the core of the product name.

OTJBCD_2019_v20n1_57_f0007.png 이미지

(Figure 7) Examples of Natural Language Analyzers included in KoNLPy and Twitter

4.2 Naive Bayes Classifier[4]

It is one of the machine learning classification algorithms and we can use it for classification of documents or keyword search. It is a probabilistic classification that applies Bayes Theorem to the conditional probability and classifies the in-put data assuming independence of the probability that eachelement constituting the document or data. Naive Bayesshould be preceded by consequent learning previously through supervised learning data as it can be implemented as a machine learning algorithm.

In this study, we applied the classification method to the classification using the Naive Bayes classifier and improved the accuracy of the classification and recommendation results regardless of the number of data included in each catalogentry.

\(\begin{aligned} \log p\left(C_{k} | \mathbf{x}\right) & \propto \log \left(p\left(C_{k}\right) \prod_{i=1}^{n} p_{k i}^{x_{i}}\right) \\ &=\log p\left(C_{k}\right)+\sum_{i=1}^{n} x_{i} \cdot \log p_{k i} \\ &=b+\mathbf{w}_{k}^{\top} \mathbf{x} \end{aligned}\)

When a seller want to enroll item into Domeggook onlinecatalog entry DB, he/she input item name as a natural lan-guage (especially Korean Language). Therefore, we used both KoNLPy/Twitter and Naïve Bayes classifier to suggest ap-propriate catalog entry using supervised machine learning process as follow figure.

OTJBCD_2019_v20n1_57_f0008.png 이미지

(Figure 8) Catalog Entry Auto-Recommendation Process using both KoNLPy/Twitter and Naïve Bayes Classifier with Supervised Machine Learning Process

5. Proposed Catalog Entry Auto Recommendation System

5.1 Structure of Catalog Entry Classification

Based on the above, we propose the process applied to the automatic classification and recommendation system of product categories developed for wholesale online B2B shop-ping mall.

OTJBCD_2019_v20n1_57_f0009.png 이미지

(Figure 9) Proposed Catalog Entry Auto-Recommendation System for Wholesale Online B2B Shopping Mall

First, the product catalog entry classification system is used as learning data by using the product name of each cat-alog entry stored through the connection between the serveron which the classification system operates and the DB server. At this time, the user inputs the product name (textinformation) data in the process of input & register the prod-uct intends to sell in the online shopping mall. In addition, the completed information is stored in the DB server in cate-gories, words (noun, English), number of occurrences and scores as a Predictive Model.

In order to perform accurate and efficient learning, con-tinuous learning process is required for as many data aspossible. Therefore, if a method of updating the contents stored in the DB server through a periodic learning process rather than a real-time processing method is used, it is possi-ble to construct an efficient catalog entry recommendationsystem for a large-capacity online shopping mall. As a result, we implemented the following periodic learning method.

(Table 1) Predictive DB for Enrolled Items

OTJBCD_2019_v20n1_57_t0001.png 이미지

OTJBCD_2019_v20n1_57_f0010.png 이미지

(Figure 10) Periodic Prediction Learning Structure

When using the registered product for catalog entry fore-casting and recommendation purposes, it is implemented tominimize the load of the entire system by querying the DB server only for the data necessary for judgment. Also, in the process of text preprocessing, the filtering function was ap-plied before learning to remove meaningless words in ad-vance, and the accuracy of catalog entry recommendation was improved by this process.

OTJBCD_2019_v20n1_57_f0011.png 이미지

(Figure 11) Catalog Entry Classification System Structure

5.2 Experimental Data

We used 6,394 data set previously stored on B2B shop-ping mall site of GNG Commerce Co., Ltd. The data belowis part of the products list that are currently registered in the wholes ale site, and our supervised machine learning systemused this data set for classifying and recommending inputitem automatically.

(Table 2) Example of Catalog Entry Classification Learning Data

5.3 Local Test

Local testing conducted a machine learning process for the catalog entry by performing the supervised learning proc-ess in the CLI environment through the console interface. The figure below shows the data learning screen. The learning process is performed by reading the Training Data Set from the DB and storing the learned Predictive Model in the DB again.

(Figure 12) Machine Learning Process 

In case of enter the new Item, we can see recommended categories as shown in below Figure using Predictive Modeland it also provide the filtering result for a search keyword.

OTJBCD_2019_v20n1_57_f0013.png 이미지

(Figure 13) Prediction, Recommendation, and Filtering Process after Machine Learning Step

In this paper, we propose a preprocessing process for the data because we use the machine learning model based on the big data manually registered by the users. Data can beentered by users in the form of words and phrases they want. In this process, users can use the filtering function in the reg-istration process because they include vendors, discount in-formation, etc. in addition to descriptions of the products they register. Here, we show it based on the appearance in the CLI environment, and the original file is set as a * .txt file to separate words with a comma (,) like CSV.

OTJBCD_2019_v20n1_57_f0014.png 이미지

(Figure 14) Catalog Entry Accuracy Measurement

The accuracy of catalog entry discrimination and recom-mendation is measured as follows. First, it is confirmed that the catalog entry classification accuracy is about 37% whenit is judged that it is a correct classification only when the catalog entry of the stored data is exactly 100%. This is be-cause there are similar categories in the catalog entry classi-fication and there are many cases belonging to the over-lapping catalog entry. As a result, it is confirmed that the catalog entry is recommended with higher accuracy than the passive method. In addition, the present invention provides a function of recommending the top three categories having high relevance as well as the catalog entry corresponding to the first catalog entry in the product catalog entry classi-fication and recommendation system, so that users can finallyselect appropriate catalog entry.

5.4 Field Test

In fact, we conducted a direct test process to checkwhether the catalog entry classification and recommendation functions were performed well in cooperation with the data-base of GNG Commerce, Ltd. The test results showed that we automatically recommend the top three categories with high relevance and relevance to the input keywords, as con-firmed by our local tests.

In case of enter "folding seat computer desk (Korean: 접이식 좌석 컴퓨터 책상)" as the result of the web execution below, you can automatically check the three categories of " computer desk (컴퓨터 책상)", "sitting desk (좌식 책상)" and "other (기타)" automatically.

OTJBCD_2019_v20n1_57_f0015.png 이미지

(Figure 15) Catalog Entry Recommendation Results for Keyword "folding desk computer desk"

In addition, we confirmed that "Backpack (백팩)", & quot; Cross Bag (크로스백)" and "Travel Accessory Case (여행소품케이스)" were automatically recommended as categories when we requested registration as "Military Backpack Military BagTraveling Bag Climbing Bag (Korean: 밀리터리 백팩 군인 가방 여행용가방 등산가방)". And its accuracy of recommendation was also acceptably high.

OTJBCD_2019_v20n1_57_f0016.png 이미지

(Figure 16) Catalog Entry Recommendation Results for Keyword "Military Backpack Military Bag Traveling Bag Climbing Bag"

Finally, even if you input the words sep과rated by '/' such as "Golden long sword / Golden knife / Toy knife /Long knife / Ninja knife / General knife (Korean: 황금장검/ 황금칼 / 장난감칼 / 긴칼 / 닌자칼 / 장군칼)”, three cate-gories such as “Toy gun (장난감총)”, "Knife / cutting / cut-ting tool (칼 / 도마 / 커팅기구)" and "tool play (공구놀이)" categories were automatically recommended. It was some-what less accurate, but we could confirm that the relative cat-alog entry “sword (검)”, "knife (칼)" and the catalog entry containing the word "toy (장난감)" were automatically recommended.

OTJBCD_2019_v20n1_57_f0017.png 이미지

(Figure 17) Catalog Entry Recommendation Results for Keyword "Golden long sword / Golden knife / Toy knife / Long knife / Ninja knife / General knife”

As you can see in the above example, the catalog entry is recommended automatically and fairly accurately according to the input keyword. Experimental results show that the knowledge - based machine learning method can improve the classification accuracy of items and the relevance to recom-mended categories than manual passive catalog entry registra-tion method. As a result, sales of B2B online shopping mallscan be expected to improve.

6. Conclusion

We developed a catalog entry classification and automaticrecommendation algorithm for registered products based onsupervised machine learning technology for purchase in-formation with a morphological analyzer (KoNLPy) to ex-tract meaningful information from the product name (text in-formation) entered by the users of large scale online B2B shopping mall. As a result, it was possible to improve the ac-curacy of the product more than when the shop user man-ually registered the shopping mall. In order to provide moreaccurate catalog entry classification and automatic recom-mendation function, it is necessary to improve the quality oflearning data by eliminating redundant words in the filtering process and to correct errors in the registered catalog entry.

References

  1. GNG Commerce B2B Online Shopping Mall Web site : http://domeggook.com
  2. Willi Richert, Luis Pedro Coelho, "Building Machine Learning Systems with Python," Packt, 2013.
  3. Eunjeong L. Park, Sungzoon Cho, "KoNLPy: Korean natual language processing in Python", 26th Hangul and Korean Information Processing Proceeding, 2014.
  4. Naive Bayes classifier : https://en.wikipedia.org/wiki/Naive_Bayes_classifier
  5. Ki-Duk Kwon, "A Naive Bayesian-based Model of the Opponent's Policy for Efficient Multiagent Reinforcement Learning," Journal of Internet Computing and Services, vol. 9, no. 6, pp. 165-178, 2008.
  6. Scikit-Learn : http://scikit-learn.org
  7. Jun Lee, and YongJin KWON, "A Proposal of Methods for Extracting Temporal Information of History-related Web Document based on Historical Objects Using Machine Learning Techniques," Journal of Internet Computing and Services, vol. 16, no. 4, pp. 39-50, 2015. https://doi.org/10.7472/jksii.2015.16.4.39
  8. Im San-Song, Na Cheol-Hun, and Jung Hoe-Kyung, "Design and implementation of integrated e-catalog system based on web services," Journal of Internet Computing and Services, vol. 6, no. 2, pp. 153-0, 2005.
  9. Young-Won Seo, and Chang-soo Kim, "Clustering Algorithm using the DFP-Tree based on the MapReduce," Journal of Internet Computing and Services, vol. 16, no. 6, pp. 23-30, 2015. https://doi.org/10.7472/jksii.2015.16.6.23
  10. Seung-Su Chun, "A study on integrating and discovery of semantic based knowledge model," Journal of Internet Computing and Services, vol. 15, no. 6, pp. 99-106, 2014. https://doi.org/10.7472/jksii.2014.15.6.99
  11. V. Jothi Prakash, Dr. L.M. Nithya, "A Survey On Semi-Supervised Learning Techniques," International Journal of Computer Trends and Technology (IJCTT), Vol.8, No.1, Feb 2014. https://doi.org/10.14445/22312803/IJCTT-V8P105
  12. Alejandro Cholaquidis, Ricardo Fraiman and Mariela Sued, "Semi-supervised learning: When and why it works," May, 2018, https://arxiv.org/pdf/1805.09180 [DOI: arXiv:1805.09180].
  13. Semi-supervised learning : https://en.wikipedia.org/wiki/Semi-supervised_learning
  14. Huimin Pei, Kuaini Wang, Qiang Lin, Ping Zhong, "Robust semi-supervised extreme learning machine," Knowledge-Based Systems, Vol. 159, No. 1, pp.203-220, Nov, 2018. https://doi.org/10.1016/j.knosys.2018.06.029