황명하
(Myeong-Ha Hwang)
†iD
이인태
(In-Tae Lee)
1iD
채창훈
(Chang-Hun Chae)
1iD
정남준
(Nam-Joon Jung)
1iD
-
(Digital Solution Laboratory, Korea Electric Power Research Institute(KEPRI), Korea.)
Copyright © The Korean Institute of Electrical Engineers(KIEE)
Key words
Deep Learning, Natural Language Processing, Power Generation, Diagnostic Service, Text Mining, Framework
1. 서 론
Natural language processing (NLP) is emerging as one of the most frequently used technologies
in a wide range of artificial intelligence areas. With the recent advancement of NLP
technology and the continuous increase in the value and amount of data, programs capable
of recognizing natural language such as human speaking and writing have already been
widely used in relation to clinical documents, airline reservations, and vehicle roadside
support.
The global market for software, hardware, and services in the NLP sector is forecast
as follows. Tractica forecast that the NLP market, which was estimated at $\$$277.2
million in 2015, would grow by an average of 25% annually to reach $\$$2.1 billion
by 2024. Owing to the high demand in the NLP sector, its market size has been increasing
(1).
Deep learning technology related to embedding that vectorizes the similarity between
words has been attracting particular attention in the NLP sector. However, there are
no cases of applying this technology to the electric power industry and no corresponding
service frameworks. Moreover, thousands of knowledge documents produced by electric
generator operation experts have been collected for about 20 years by Korea Electric
Power Corporation (KEPCO), but they have rarely been applied to electric generator
operation.
Therefore, in this report, we propose Gen2Vec, an NLP framework for electric generator
operation knowledge services using deep learning technology. Gen2Vec is a framework
that includes a preprocessing function that extracts nouns from search sentences or
words, a word recommendation function that recommends words related to search words
using deep learning-based word embedding, and a document recommendation function that
recommends documents related to search words. In the future, Gen2Vec can be the core
engine of the knowledge system for electric generator operation experts and will be
used in search and chatbot services for electric generator operation programs and
new employee education.
2. Related Work
2.1 Word Embedding based on Deep Learning
Embedding is the state in which a graph can be drawn so that edges do not cross each
other on the surface (2). Recently, embedding techniques have been widely employed in the NLP sector. Typical
embedding techniques include word embedding, in which words are represented as vectors,
and the term frequency (TF)–inverse document frequency (IDF) approach, in which the
importance of each word in a document is quantified.
First, for word embedding, there are prediction-based models, including neural probabilistic
language models such as Word2Vec and FastText and matrix factorization-based models
such as latent semantic analysis and the global word vector model (3-8). In this study, we selected Word2Vec, which showed good performance in evaluating
word similarity and was widely used in various industries, as the word embedding technique.
As shown in Figure 1, there are two training methods in Word2Vec, that is, the continuous bag of word
(CBOW) and Skip-gram methods. The CBOW technique is a training method that predicts
a target word based on the surrounding words (Fig. 1(a)). On the other hand, Skip-gram is a training method that predicts the surrounding
words using a target word (Fig. 1(b)).
Second, TF-IDF, as a weight used in information searching and text mining, is a statistical
number indicating how important a word is in a particular document (9). The TF indicates how often a specific word appears in a document, and the IDF is
the inverse of the document frequency. The TF-IDF can be obtained by multiplying the
TF times the IDF. The TF-IDF is used to rank search results in search engines and
to measure the similarity between documents within a document cluster.
Minarro-Gimenez et al. conducted a study on improving the accessibility of medical
knowledge by applying Word2Vec to medical documents, and Husain and Dih developed
a mobile application recommendation system for travelers based on TF-IDF content (10-11). Similarly, research using deep learning-based word embedding and the TF-IDF has
been actively underway in various industries. However, research on the application
of this technology in the electric power sector is insufficient.
그림. 1. CBOW와 Skip-gram 구조
Fig. 1. CBOW and Skip-gram Structure
2.2 Framwork
A framework is a software environment that provides a reusable design and implementation
of parts corresponding to specific software functions in a collaborative form to make
the development of a software platform effective (12). The framework can be maintained through systematic code management and is highly
reusable. It has high development productivity by providing a function library.
Regarding research on frameworks, Bedi and Toshniwal developed a deep learning framework
for forecasting electricity demand using long short-term memory (13). Dowling et al. proposed an optimization framework for evaluating revenue opportunities
provided by multi-scale hierarchies in the electric power market and determining optimal
participation strategies for individual participants (14). In addition, Pinheiro and Davis Jr. endeavored to improve user convenience by managing
the characteristics and structure of data collection target themes by developing ThemeRise,
which was a framework for producing a volunteered geographic information application,
a type of cloud sourcing, and Jack Jr. proposed the National Institute on Aging and
Alzheimer's Association (NIA-AA) research framework to assist research on the biological
definition of Alzheimer’s disease (15-16). As shown in those studies, framework research has been gaining attention not only
in the electric power industry, but also in other industries. The introduction of
a framework facilitates the utilization of existing technologies conveniently from
the perspective of users and has the advantage of efficient platform development.
3. Proposed Gen2Vec Framework
3.1 Framework Architecture
The framework architecture of the proposed Gen2Vec is shown in Fig. 2. Pretraining is performed using deep learning- based Word2Vec utilizing 1,348 expert
knowledge documents for the electric generator. When the user enters a sentence or
word in the search box, the preprocessing function is performed to extract only nouns.
Then, the word recommendation function is performed by embedding based on the extracted
and pretrained words. Next, Gen2Vecscore is calculated using the words extracted by
the word recommendation function and the TF-IDF value of each document. Lastly, the
document recommendation function is performed to recommend documents related to the
search word.
그림. 2. Gen2Vec 프레임워크 구조
Fig. 2. Framework Architecture of Gen2Vec
3.2 Preprocessing Function
The preprocessing function of Gen2Vec is performed after the user enters a word or
sentence that he/she wants to search in the search box. Firstly, the word or sentence
is tokenized to separate each word into tokens, and then part of speech tagging is
used to add the part of speech to each token. After this, the word recommendation
function is performed by extracting only the noun tokens. KoNLPy that korean natural
language processing package for python was used for the preprocessing function of
Gen2Vec (17).
3.3 Word Recommendation Function
The word recommendation function was developed using the Gensim framework-based Word2Vec
for the words extracted from the preprocessing function (18). The training parameters of the Skip-gram model for extracting embedding words are
matrices u and v. The size of each matrix is the size of the vocabulary set (|v|)
by the number of embedding dimensions (d). The probability that the target word (t)
and content word (c) are positive samples is calculated using Eq. (1), and the probability that t and c are negative samples is calculated using Eq. (2).
Eq. 1(식 (1)) Positive Sample Calculation of Skip-gram
Eq. 2(식 (2)) Negative Sample Calculation of Skip-gram
The log-likelihood function of Skip-gram is Equation (3), and the word recommendation function can vectorize words in a document cluster after
training to optimize Eq. (3). Using this approach, the words related to the search word can be recommended.
Eq. 3(식 (3)) Log-likelihood Function of Skip-gram
3.4 Document Recommendation Function
The document recommendation function of the proposed Gen2Vec is as follows. When the
user enters a necessary word or sentence in the search box, the preprocessing function
is used to extract the noun words ($Word_{x_{i}}$). The extracted words are input
into the trained Word2Vec model to extract the TopN words ($Word_{y_{i}}$) with high
cosine similarity for each $Word_{x_{i}}$. The formula for obtaining the cosine similarity
is expressed in Eq. (4), and the method for obtaining $Word_{y_{i}}$ is expressed in Eq. (5).
Eq. 4(식 (4)) The formula for obtaining the cosine similarity
Eq. 5(식 (5)) The method for obtaining $Word_{y_{i}}$
After defining each word (w), each target word (t), each document (d), the total number
of documents (D), and the total frequency (frequency (f ())) for the documents, TF-IDF
is produced using Eq. (6). Then, the data frame is extracted for TF-IDF, which has the word list consisting
of $Word_{x_{i}}$ and $Word_{y_{i}}$ as the column value and has the documents as
the row values. Table 1 shows an example of the extraction results.
Eq. 6(식 (6)) The formula of TF-IDF
표 1. 문서 추천 기능을 위한 데이터 프레임 예시
Table 1. Data Frame Example for Document Recommendation Function
|
$Word_{x_{1}}$
|
..
|
$Word_{x_{n}}$
|
$Word_{y_{1}}$
|
..
|
$Word_{y_{m}}$
|
Doc1
|
0.27
|
..
|
0.32
|
0.41
|
..
|
0.19
|
..
|
..
|
..
|
..
|
..
|
..
|
..
|
DocZ
|
0.13
|
..
|
0.11
|
0.21
|
..
|
0.02
|
The above data frame contains the word $Word_{x_{i}}$ that the user directly enters
in the search box and the related word $Word_{y_{i}}$ extracted by deep learning.
Therefore, Gen2Vecweight is defined because it is necessary to give different weights
to $Word_{x_{i}}$ and $Word_{y_{i}}$, as expressed in Eq. (7). Using this approach, the TF-IDF value in the data frame is updated.
Eq. 7(식 (7)) The formula of Gen2Vecweight
Next, Gen2Vecweight for each document in the data frame is summed and the TopK documents
are extracted. The function used for this calculation was defined as Gen2Vecscore.
If T is defined as the total number of $Word_{x_{i}}$ and $Word_{y_{i}}$, the method
of obtaining Gen2Vecscore can be expressed as shown in Eq. (8).
Eq. 8(식 (8)) The formula of Gen2Vecscore
4. Experiments and Results
4.1 Expert Documents for Diagnostic Services of Power Generation Facility
KEPCO has operated electric generators at each of their power plants for about 20
years, and experts have directly diagnosed the generators, accumulating 1,348 documents.
The experts utilized categories of boiler, electric generator, performance, gas turbine,
and steam turbine in the diagnosis and designated subcategories such as fault diagnosis
and precision diagnosis. Table 2 shows the statistics of expert documents on electric generator operation collected
from 2000 to 2018. The Gen2Vec developed in this study was trained for these documents.
표 2. 발전설비 진단을 위한 전문가 문서 군집
Table 2. Expert Documents for Diagnostic Services of Power Generation Facility
Category
|
Subcategory
|
Number of documents
|
Boiler
|
Fault diagnosis
|
52
|
Precision diagnosis
|
380
|
Gas turbine
|
Fault diagnosis
|
34
|
Precision diagnosis
|
53
|
Steam
turbine
|
Fault diagnosis
|
33
|
Precision diagnosis
|
57
|
Electric
generator
|
Leak absorption
|
122
|
Prevention diagnosis
|
37
|
Performance
|
Insulation diagnosis
|
546
|
Precision diagnosis
|
34
|
Total
|
1,348
|
표 3. 단어 추천 기능의 결과 예시
Table 3. Result Example of Word Recommendation Function
$Word_{x_{1}}$$~ Word_{x_{5}}$
[Korean]
(Cosine Similarity)
|
Gas turbine
[가스터빈]
|
Compressor
[압축기]
|
Blade
[블레이드]
|
Crack
[균열]
|
Occurrence
[발생]
|
$Word_{y_{1}}$
$~ Word_{y_{5}}$
|
Trial run
[시운전]
(0.88)
|
Blade
[블레이드]
(0.97)
|
Compressor
[압축기]
(0.97)
|
Tiny
[미세]
(0.84)
|
Majority
[다수]
(0.86)
|
$Word_{y_{6}}$
$~ Word_{y_{10}}$
|
Gunsan
[군산]
(0.85)
|
Bucket
[버켓]
(0.91)
|
Bucket
[버켓]
(0.93)
|
Progress
[진전]
(0.83)
|
Similarity
[유사]
(0.85)
|
$Word_{y_{11}}$
$~ Word_{y_{15}}$
|
Low pressure
[저압]
(0.83)
|
Past
[과거]
(0.88)
|
Vane
[베인]
(0.91)
|
Fault
[결함]
(0.83)
|
Order
[차례]
(0.85)
|
$Word_{y_{16}}$
$~ Word_{y_{20}}$
|
Turbine
[터빈]
(0.81)
|
Rotor
[로터]
(0.85)
|
Recommendation
[권고]
(0.88)
|
Expansion
[확대]
(0.81)
|
Estimation
[추정]
(0.85)
|
$Word_{y_{21}}$
$~ Word_{y_{25}}$
|
Component
[부품]
(0.80)
|
Type
[종류]
(0.84)
|
Rotor
[로터]
(0.88)
|
Discovery
[발견]
(0.81)
|
Many
[여러]
(0.84)
|
그림. 3. 전처리 기능 예시
Fig. 3. An Example of Preprocessing Function
4.2 Preprocessing and Word Recommendation
The experimental example and results of the preprocessing function are shown in Fig. 3. After entering a word or sentence that the user wants to search in the search box,
tokenization is applied to separate the word or sentence into tokens when the preprocessing
function is applied (Fig. 3 (a), (b)). Next, the part of speech is tagged to each token, and the nouns are extracted (Fig. 3(c), (d)). The extracted nouns are input into the word recommendation function.
The experimental example and results of the word recommendation function are shown
in Table 3. In this experiment, the nouns ($Word_{x_{i}}$) extracted by the preprocessing function
as shown in Fig. 3 were pretrained with the Word2Vec model. For pretraining, the word vector dimension
was set to 1,000, the window size was set to 4, and the downsample setting for frequently
appearing words was set to 1e-3. By applying these parameters, the embedding words
($Word_{y_{i}}$) corresponding to TopN were extracted. Here, the embedding words were
output in order of descending cosine similarity. In this experiment, N was assumed
to be 5, and each extracted $Word_{y_{i}}$ had a cosine similarity value.
The experimental results show that the words that were highly related to each $Word_{x_{i}}$
were extracted as $Word_{y_{i}}$. The extracted words included duplicate words. Other
than the duplicate words with high cosine similarity, the duplicate words were excluded
from $Word_{y_{i}}$ to be used in the document recommendation function.
4.3 Document Recommendation Function
The experimental example and results of the document recommendation function are presented
in Table 4. The TF-IDF value that was pretrained for the nouns ($Word_{x_{i}}$) extracted by
the preprocessing function and words recommended by the word recommendation function
($Word_{y_{i}}$) was updated with the proposed Gen2Vecweight. Next, the documents
extracted using Gen2Vecscore were presented to the user. Defining the K value used
to obtain TopK as 10, in relation to the search words in the example employed in this
study, the extracted documents are listed in Table 4.
표 4. 문서 추천 기능의 결과 예시
Table 4. Result Example of Document Recommendation Function
Rank
|
Document name
|
Gen2Vecscore
|
1
|
Yeongwol natural gas power plant gas turbine report (1)
|
2.44
|
2
|
Seo-incheon 5GT OH technical support report (1)
|
2.39
|
3
|
Yeongwol natural gas power plant gas turbine report (2)
|
2.35
|
4
|
Busan 3GT report
|
2.18
|
5
|
Seo-incheon unit 1 gas turbine maintenance work
technical support report
|
2.17
|
6
|
Bundang 8GT 1 blade damage report
|
2.16
|
7
|
Pyeongtaek 3GT composite report (1)
|
2.10
|
8
|
Pyeongtaek 3GT composite report (2)
|
2.08
|
9
|
Seo-incheon 5GT OH technical support report (2)
|
2.07
|
10
|
Yeongwol 2GT high temperature parts damage report
|
1.98
|
The results in Table 4 confirm that the documents related to the nouns extracted from the search words,
such as the gas turbine report, blade damage report, and high-temperature parts damage
report, were extracted. The accuracy result of the document recommendation function
is in Table 5. Precision, Recall, and F1 results were derived for Gen2Vec, Word2Vec, and TF-IDF.
Gen2Vec proved about 3.9% and 10.8% higher than Word2Vec and TF-IDF.
표 5. 문서 추천 기능의 성능평가
Table 5. Evaluated Performance of Document Recommendation Function
Algorithm
|
Precision(%)
|
Recall(%)
|
Accuracy(%)
|
Gen2Vec
|
81.3
|
84.9
|
83.1
|
Word2Vec
|
78.2
|
80.1
|
79.2
|
TF-IDF
|
71.9
|
72.7
|
72.3
|
5. Conclusion
In this report, we proposed Gen2Vec, a knowledge service framework required for electric
generator operation, based on user search words. Gen2Vec offers three functions to
provide efficient knowledge services. First, there is a preprocessing function that
separates a sentence entered by the user into tokens and extracts only nouns. Second,
there is a word recommendation function that recommends words related to the search
words by applying a model trained by deep learning. Last, there is a document recommendation
function that extracts highly related documents by applying Gen2Vecweight and Gen2Vecscore
to the TF-IDF values pretrained for words extracted with the preprocessing and word
recommendation functions.
When using Gen2Vec in this way, experts and new employees who operate electric generators
can quickly extract expert documents accumulated over 20 years by KEPCO when diagnosing
generators. Consequently, operators and new employees can obtain expert knowledge
without any experts in the power plants and can easily apply this knowledge in the
field.
In the future, we are planning on extending the word and document recommendation functions
of Gen2Vec into a person- alized recommendation function by developing optimization
functions related to the search words of individuals. Furthermore, we are developing
Gen2Vec with training in multiple languages such as English and Chinese and working
on improving user-friendliness by developing a chatbot service and voice-based search
service by equipping this as the core engine of knowledge services for electric generator
operation.
Acknowledgements
This work was funded by the Korea Electric Power Corporation (KEPCO).
References
R. Madhavan, 2018, Natural language processing current appli- cations and future possibilities,
Tractica Omdia
A. Ittai, B. Yair, N. Ofer, Sep 2011, Advances in metric embedding theory, Advances
in Mathematics, Vol. 228, pp. 3026-3126
Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin, Feb 2003, A neural probabilistic language
model, Journal of Machine Learning, Vol. 3, pp. 1137-1155
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Dec 2013, Distributed representations
of words and phrases and their compositionality, Proceedings of the 26th International
Conference on Neural Information Processing Systems (NIPS), Australia, Vol. 2, pp.
3111-3119
T. Mikolov, K. Chen, G. Corrado, J. Dean, Jan 2013, Efficient estimation of word representations
in vector space, Proceedings of the International Conference on Learning Representations
(ICLR), USA
A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, April 2017, Bag of tricks for efficient
text classification, Proceedings of the 15th Conference of the European Chapter of
the Association for Computational Linguistics Spain, Vol. 2, pp. 427-431
S. T. Dumais, 2005, Latent semantic analysis, Annual Review of Information Science
and Technology, Vol. 38, pp. 188-230
J. Pennington, R. Socher, C. D. Manning, Oct 2014, Glove: Global vectors for word
representation, Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), Qatar, pp. 1532-1543
G. Salton, M. J. McGill, 1983, Introduction to Modern Inform- ation retrieval, McGraw-Hill
J. A. Minarro-Gimenez, O. Marin-Alonso, M. Samwald, 2014, Exploring the application
of deep learning techniques on medical text corpora, 2014 European Federation for
Medical Informatics and IOS Press, pp. 584-588
W. Husain, L. Y. Dih, July 2012, A framework of a personalized location-based traveler
recommendation system in mobile application, International Journal of Multimedia and
Ubiquitous Engineering, Vol. 7, pp. 11-18
A. Gachet, Software frameworks for developing decision support systems – A new component
in the classification of DSS development tools, Journal of Decision Systems, Vol.
12, No. 3, pp. 271-281
J. Bedi, D. Toshniwal, Jan 2019, Deep learning framework to forecast electricity demand,
Applied Energy, Vol. 238, pp. 1312-1326
A. W. Dowling, R. Kumar, V. M. Zavala, Jan 2017, A multi- scale optimization framework
for electricity market partici- pation, Applied Energy, Vol. 190, pp. 147-164
M. B. Pinheiro, C. A. Davis Jr, Jun 2018, ThemeRise: A theme- oriented framework for
volunteered geographic information applications, Journal of Open Geospatial Data,
Software and Standards, Vol. 1, pp. 3-9
C. R. Jack Jr, D. A. Bennett, K. Blennow, M. C. Carrillo, B. Dunn, X. B. Haeberlein,
D. M. Holtzman, W. Jagust, F. Jessen, J. Karlawish, E. Lilu, J. L. Molinuevo, T. Montine,
C. Phelps, K. P. Rankin, C. C. Rowe, P. Scheltens, E. Siemers, H. M. Snyder, R. Sperling,
2018, NIA-AA research framework: Toward a biological definition of Alzheimer’s disease,
Alzheimers Dement, Vol. 14, pp. 535-562
E. L. Park, S. Cho, 2014, KoNLPy: Korean natural language processing in Python, Proceedings
of the 26th Annual Conference on Human and Cognitive Language Technology, pp. 133-136
R. Rehurek, P. Sojka, 2011, Gensim-Statistical semantics in Python, The 4th European
Meeting on Python in Science
저자소개
He has received B.S. degree in Department of Information and Communication Engineering,
from Chungnam National University (CNU), Korea in 2015 and M.E. degree in Information
and Communication Network Technology from University of Science and Technology (UST),
Korea in 2018, and currently work for Korea Electric Power Research Institute (KEPRI).
His current research interests Deep Learning and Natural Language Processing (NLP).
He is currently working as a principal resear- cher in KEPCO Research Institute (KEPRI),
Daejeon, Korea.
He received his M.S. of com- puter science from Korea University.
He received M.S. degree in Information and Mechanical Engineering from Gwangju Institute
of Science and Technology (GIST).
His Major is Computer Science on general and, in specific, Augmented Reality and Computer
Vision.
He received his Ph.D. degree in computer engineering from Hanbat University.
His research interests are AI, VR/AR and Drone Applications.