• 대한전기학회
Mobile QR Code QR CODE : The Transactions of the Korean Institute of Electrical Engineers
  • COPE
  • kcse
  • 한국과학기술단체총연합회
  • 한국학술지인용색인
  • Scopus
  • crossref
  • orcid
Title Text Encoding and Language Identification via N-Gram Feature and UTF-8 Encoding Detection
Authors 홍채희(Chaehui Hong) ; 조현지(Hyunji Cho) ; 유훈(Hoon Yoo)
DOI https://doi.org/10.5370/KIEE.2024.73.9.1574
Page pp.1574-1580
ISSN 1975-8359
Keywords Language Identification; Text Data Encoding; N-gram; Machine Learning; UTF-8; Code Page
Abstract This paper presents a method for automatically identifying the encoding and language of documents. In the online world, a technique for document encoding and language identification plays an important role in providing users with easy access to the information they need, and in improving the efficiency of data processing and analysis. In this paper, the proposed method first identifies whether a document is encoded in UTF-8 or not by analyzing the bit pattern. For the UTF-8, the language is identified by calculating the percentage of each language in the document through Unicode range analysis. If the document is found to be not UTF-8, it is determined to be a code page document, and the languages in the document are identified by our machine learning technique using N-grams. To evaluate the proposed method, we conducted experiments. The experimental results indicate that the proposed method improves the encoding and language identification performance compared to the existing methods.