US20240362263A1 - Method for managing language data and server using the same - Google Patents
Method for managing language data and server using the same Download PDFInfo
- Publication number
- US20240362263A1 US20240362263A1 US18/039,384 US202218039384A US2024362263A1 US 20240362263 A1 US20240362263 A1 US 20240362263A1 US 202218039384 A US202218039384 A US 202218039384A US 2024362263 A1 US2024362263 A1 US 2024362263A1
- Authority
- US
- United States
- Prior art keywords
- language data
- word
- node
- reference value
- pieces
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims description 25
- 239000013598 vector Substances 0.000 claims abstract description 101
- 230000006870 function Effects 0.000 claims abstract description 11
- 238000004891 communication Methods 0.000 claims description 9
- 230000004044 response Effects 0.000 claims description 5
- 230000000875 corresponding effect Effects 0.000 description 26
- 238000012549 training Methods 0.000 description 11
- 230000008569 process Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 3
- 230000015654 memory Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 229920006395 saturated elastomer Polymers 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Creation or modification of classes or clusters
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/322—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
Definitions
- the present invention relates to a method for managing language data and a server using the same, and particularly relates to a method for managing language data for determining similarity, the method including, in a state in which the language data in a tree structure includes at least one node and the at least one node includes at least one word, generating, by a management server, a plurality of word vectors including a first word vector and a second word vector based on the number of words included in each of a plurality of pieces of language data, and using, by the management server, the dot product function of the first word vector and the second word vector to measure a score of similarity between first language data corresponding to the first word vector and second language data corresponding to the second word vector.
- the field of language requires a wide range of subjects and diverse language data in order to improve the training abilities of artificial intelligence machines. In such a case, indiscriminate data are not beneficial to improvement in the training abilities, and thus a process of establishing a relationship between pieces of various language data is necessary.
- the present inventor proposes a method for managing language data and a server using the same herein.
- An objective of the present invention is to support an AI training process by classifying a plurality of pieces of language data by category or level and storing the data.
- Another objective of the present invention is to determine the level of similarity of each of a plurality of pieces of language data and classify the results in advance so that the data is utilized.
- a method for managing language data for determining similarity includes, in a state in which the language data in a tree structure includes at least one node and the at least one node includes at least one word, a step of generating, by a management server, a plurality of word vectors including a first word vector and a second word vector based on the number of words included in each of a plurality of pieces of language data, and a step of using, by the management server, a dot product function of the first word vector and the second word vector to measure a score of similarity between first language data corresponding to the first word vector and second language data corresponding to the second word vector.
- the management server may concatenate at least one word included in each of the plurality of pieces of language data, eliminate a stop word, and thus acquire a plurality of pieces of valid data, and generate the plurality of word vectors based on the number of keywords included in each piece of the plurality of valid data.
- the management server may group the first word vector corresponding to the first language data and the second word vector corresponding to the second language data to generate and express one cluster on a graph.
- the reference value may include a first reference value, a second reference value, and a third reference value, and the magnitudes of the first reference value, the second reference value, and the third reference value may sequentially increase
- the management server may group word vectors of a pair of pieces of language data having a score of similarity higher than the first reference value together and then generate a plurality of first clusters on a graph, group word vectors of a pair of pieces of language data having a score of similarity higher than the second reference value together and then generate a plurality of second clusters on the graph, group word vectors of a pair of pieces of language data having a score of similarity higher than the third reference value together and then generate a plurality of third clusters on the graph, and acquire the second reference value satisfying a condition that the number of the plurality of second clusters is greater than the number of the plurality of first clusters or the number of the pluralit
- the language data includes a plurality of nodes constituted by one sentence and the plurality of nodes include a first node, a 1-1 node corresponding to a left child node of the first node, and a 1-2 node corresponding to a right child node of the first node
- a 1-1 sentence included in the 1-1 node and a 1-2 sentence included in the 1-2 node may correspond to a response to a first sentence included in the first node.
- a level of the language data may be determined based on a word included in the first sentence, a word included in the 1-1 sentence, and a word included in the 1-2 sentence.
- each of the plurality of pieces of language data in the tree structure may have the same depth.
- a management server that manages language data for determining similarity includes a communication unit, a database, and a processor that, in a state in which the language data in a tree structure includes at least one node and the at least one node includes at least one word, generates a plurality of word vectors including a first word vector and a second word vector based on the number of words included in each of a plurality of pieces of language data, and uses a dot product function of the first word vector and the second word vector to measure a score of similarity between first language data corresponding to the first word vector and second language data corresponding to the second word vector.
- the effect of supporting an AI training process by classifying a plurality of pieces of language data by category or level and storing the data can be exhibited.
- the effect of determining the level of similarity of each of a plurality of pieces of language data and classifying the results in advance so that the data is utilized can be exhibited.
- FIG. 1 is a diagram illustrating an overall configuration of a management server according to an embodiment of the present invention.
- FIG. 2 is a diagram illustrating language data in a tree structure according to an embodiment of the present invention.
- FIG. 3 is a diagram illustrating a process of measuring a score of similarity between a plurality of pieces of language data according to an embodiment of the present invention.
- FIG. 4 are graphs showing the number of clusters generated based on the magnitudes of reference values according to an embodiment of the present invention.
- FIG. 5 is a table showing pieces of information of language data according to an embodiment of the present invention.
- FIG. 1 is a diagram illustrating an overall configuration of a management server according to an embodiment of the present invention.
- a management server ( 100 ) of the present invention includes a communication unit ( 110 ) and a processor ( 120 ) as illustrated in FIG. 1 , and may not include a database ( 130 ) in some cases, unlike in FIG. 1 .
- the communication unit ( 110 ) of the management server ( 200 ) can be realized by using any of various communication technologies. That is, Wi-Fi, wideband CDMA (WCDMA), High-Speed Downlink Packet Access (HSDPA), High-Speed Uplink Packet Access (HSUPA), High-Speed Packet Access (HSPA), mobile WiMAX, Wi-Bro, Long Term Evolution, 5G, 6G, Bluetooth, Infrared Data Association (IrDA), Near Field Communication (NFC), Zigbee, a wireless LAN technology, and the like can be applied thereto.
- the communication unit may comply with TCP/IP that is a standard protocol for transmitting information on the Internet when a service is to be provided in connection to the Internet.
- the database ( 13 ) of the invention of the present specification can store information of a plurality of pieces of language data (e.g., tree numbers, the number of nodes, node depths, etc.).
- the management server ( 100 ) is able to access the external database through the communication unit ( 110 ).
- the processor ( 120 ) can include hardware configurations such as a micro-processing unit (MPU) or a central processing unit (CPU), a cache memory, and a data bus.
- the management server ( 200 ) can further include software configurations such as an operating system and an application that accomplishes a specific goal.
- an administrator can transmit and receive information to and from the management server ( 100 ) through a terminal (not illustrated), and with respect to the terminal, a desktop computer, a notebook computer, a work station, a PDA, a web pad, a mobile telephone, a smart remote controller, any of various IoT main devices may correspond to a terminal according to the present invention as long as it can perform communication, is equipped with a memory section, and has a micro-processor mounted therein to be able to perform arithmetic operations.
- FIG. 2 is a diagram illustrating language data in a tree structure according to an embodiment of the present invention.
- Language data that the present invention deals with may have a tree structure and include at least one node.
- the at least one node may include one sentence.
- One piece of language data has a tree structure, and may include a plurality of nodes and sentences such as a root node (“Do you like robots?”), child nodes (“Yes, I have a robot collection at home,” and “No, I do not like robots”), grandchild nodes (“What materials are your robots made of?,” and the like), lower nodes, and leaf nodes (“Yes, I have one of those,” “Yes, I have a Lego robot,” and the like) as illustrated in FIG. 2 .
- the language data of the present invention has a binary tree structure as illustrated in FIG. 2
- the language data in the tree structure
- the language data of the present invention may have any of various tree structure including one root node.
- the trees of the plurality of pieces of language data may all have the same depth (level).
- the trees may have a depth (level) 8 as saturated binary trees, and thus all of the language data (trees) may include 255 nodes and 255 sentences.
- the trees of the plurality of pieces of language data may have different depths and different numbers of nodes (sentences) in some cases.
- a plurality of nodes included in any one piece of language data include a first node, a 1-1 node corresponding to the left child node of the first node, and a 1-2 node corresponding to the right child node of the first node.
- a 1-1 sentence included in the 1-1 node and a 1-2 sentence included in the 1-2 node may be responses to a first sentence included in the first node.
- responses to “Do you have a Lego robot, too?” included in the first node may be “Yes, I have one of those” of the 1-1 node and “Yes, I have a Lego robot” of the 1-2 node.
- all sentences (corresponding to nodes) included in one piece of language data (in one tree structure) may be related to each other (in terms of the same subject and same category).
- the sentences (nodes) included in the plurality of pieces of language data are acquired from a terminal (input) of an operator, and the tree structure may be determined by the operator as well based on the input from the terminal.
- the plurality of pieces of language data generated through the above-described process can be stored in the database ( 130 ).
- a process performed by the processor ( 120 ) of the management server ( 100 ) of the present invention will be discussed below with reference to FIG. 3 .
- FIG. 3 is a diagram illustrating a process of measuring a score of similarity between a plurality of pieces of language data according to an embodiment of the present invention.
- the processor ( 120 ) of the management server ( 100 ) can generate a plurality of word vectors based on the number of words included in each of the plurality of pieces of language data (S 310 ). At this time, the plurality of word vectors may include a first word vector and a second word vector.
- the processor ( 120 ) of the management server ( 100 ) concatenates at least one word included in each of the plurality of pieces of language data, eliminate stop words, and thereby can acquire a plurality of pieces of valid data. Since language data includes a plurality of nodes (sentences) as described above, the processor ( 120 ) can eliminate the stop words included in the sentences by concatenating the plurality of sentences into one.
- the processor ( 120 ) can eliminate not only the stop words but also punctuations and typos.
- One piece of valid data (data created by concatenating one word or sentence excluding stop words) can be acquired from a certain piece of language data, and the processor ( 120 ) can generate a plurality of word vectors based on the number of keywords included in each of a plurality of pieces of acquired valid data.
- keywords such as ‘rain,’ ‘weather,’ and ‘music’ may be the examples, and they may be set in advance. In some cases, such keywords may not be set in advance and may be determined based on the words included in each piece of the valid data (e.g., a predetermined number of words may be set as keywords, the words being selected from those having been appeared in the plurality of pieces of valid data a larger number of times in descending order). Keywords may vary in each category to be described below.
- the processor ( 120 ) may count how many times each of the keywords is included in the plurality of pieces of language data (valid data), and then generate multi-dimensional (e.g., four-dimensional) word vectors based on the results.
- the keywords ‘rain,’ ‘weather,’ ‘robot,’ and ‘music’ are considered in the above description, practically more than 300 keywords can be considered, and thus the word vectors in that case may also be about 300-dimensional.
- the keywords may be set in advance, but may vary depending on words included in the plurality of pieces of language data.
- Word vectors according to an embodiment of the present invention are as follows.
- the total number of words included in specific language data (specific tree), the frequency of appearance of a keyword in the specific language data, the number of the plurality of pieces of language data (trees), and the number of pieces of language data (trees) including the keyword can be calculated.
- TF calculated with the formula “the frequency of appearance of a keyword in the specific language data/the total number of words included in the specific language data”
- IDF inverse document frequency
- the processor may acquire a calculated value for each keyword as the result of TF (the frequency of appearance of a keyword in the specific language data/the total number of words included in the specific language data) ⁇ IDF (Log (the number of the plurality of pieces of language data/the number of pieces of language data including the keyword)), and as a result, the processor can generate a plurality of multi-dimensional word vectors (0.52, 0.48, 0.28, . . . ).
- the plurality of word vectors can be generated by using so-called Terms Frequency-Inverse Document Frequency (TF-IDF) in the present invention.
- TF-IDF Terms Frequency-Inverse Document Frequency
- the processor ( 120 ) of the management server ( 100 ) can use the dot product function between the first word vector and the second word vector to measure a score of similarity between first language data corresponding to the first word vector and second language data corresponding to the second word vector (S 320 ).
- the dot product function is a kind of cosine function from which a score of similarity between the multi-dimensional (e.g., 300-dimensional) first word vector and second word vector can be measured.
- the score of similarity may be the result of the dot product function greater than zero and smaller than 1.
- the score of similarity between the first word vector (corresponding to the first language data) and the second word vector (corresponding to the second language data) may be higher than the score of similarity between the first word vector and the third word vector (corresponding to the third language data).
- the processor ( 120 ) of the management server ( 100 ) groups the first word vector corresponding to the first language data and the second word vector corresponding to the second language data, and then creates one cluster on a graph (a graph on which multi-dimensional word vectors can be plotted) to represent the word vectors.
- the word vectors mean the point of each type of language data marked on the multi-dimensional graph
- the cluster mean the group in which the word vectors are concatenated.
- each cluster means one subject (category), and there will be a high probability that the word vectors (language data) included in one cluster share the same subject.
- one of the word vectors is not allowed to constitute the cluster and two or more concatenated word vectors can constitute one cluster.
- a process in which a score of similarity between a pair of word vectors (e.g., the first word vector and the second word vector) among a plurality of word vectors is measured and expressed will be described below with reference to FIG. 4 .
- FIG. 4 are graphs showing the number of clusters created based on the magnitudes of reference values according to an embodiment of the present invention.
- the graphs are merely represented in two dimensions, but may actually correspond to graphs representing multi-dimensions (e.g., 300-dimensions).
- the reference values include at least a first reference value, a second reference value, and a third reference value, and the magnitudes of the first reference value, the second reference value, and the third reference value sequentially increase.
- first reference value ⁇ second reference value ⁇ third reference value is assumed.
- the processor ( 120 ) of the management server ( 100 ) groups the word vectors of a pair of pieces of language data having a score of similarity higher than the first reference value together and then generates a plurality of first clusters on a graph.
- the processor ( 120 ) groups the word vectors of a pair of pieces of language data having a score of similarity higher than the second reference value together and then generates a plurality of second clusters on the graph, and groups the word vectors of a pair of pieces of language data having a score of similarity higher than the third reference value together and then generates a plurality of third clusters on the graph.
- the processor ( 120 ) can acquire the second reference value satisfying a condition that the number of the plurality of second clusters is greater than the number of the plurality of first clusters or the number of the plurality of third clusters.
- the number of clusters is one when the reference value is 0.1, the number of clusters is eight when the reference value is 0.2, the number of clusters is 33 when the reference value is 0.3, the number of clusters is 53 when the reference value is 0.4, the number of clusters is 70 when the reference value is 0.5, the number of clusters is 75 when the reference value is 0.6, the number of clusters is 61 when the reference value is 0.7, the number of clusters is 27 when the reference value is 0.8, and the number of clusters is five when the reference value is 0.9.
- the word vectors can be divided into eight small-sized clusters. This means that the number of clusters increases as the reference value becomes higher. However, pairs of pieces of language data (word vectors) having a low score of similarity are no longer concatenated when the reference value becomes higher than 0.6, and thus the number of clusters may become smaller, which is reflected FIG. 4 .
- the number of clusters is 70 when the reference value is 0.5
- the number of clusters is 75 when the reference value is 0.6
- the number of clusters is 61 when the reference value is 0.7.
- the processor ( 120 ) may acquire 0.6 as the second reference value. In other words, the processor ( 120 ) may acquire the largest number of clusters (categories or subjects) when the reference value is 0.6 according to an embodiment of the present invention.
- the pieces of a pair of language data having a score of similarity higher than the reference value are concatenated with each other to form a cluster
- a low reference value e.g. 0.
- the pieces of the language data may be concatenated with each other (due to the low reference value), and thus a small number of clusters (e.g., one) may be generated.
- the clusters at this time may include a large number of word vectors (language data).
- clusters there may be a small number of pairs of language data having scores of similarity higher than a high reference value (e.g., 0.9), and thus fewer (e.g., five) clusters may be similarities generated.
- the clusters at this time may include fewer word vectors (language data).
- the number of clusters may increase and then decrease as the reference value becomes higher.
- the largest number of clusters is 75, and the reference value at that time is 0.6.
- the number of clusters for each reference value and the acquired reference value may be different from those of FIG. 4 .
- the processor ( 120 ) of the present invention may determine whether pieces of language data are concatenated by adjusting the reference values described above, and thereby may determine the number, the shape, and the like of clusters accordingly.
- Word vectors (language data) included in one cluster may be determined to be similar (to have the same category or subject).
- FIG. 5 is a table showing pieces of information of language data according to an embodiment of the present invention.
- language data of the present invention may have a tree structure in which a plurality of nodes (sentences) are included.
- the plurality of nodes includes a first node (a first sentence), a 1-1 node (a 1-1 sentence), a 1-2 node (a 1-2 sentence), and the like, as in FIG. 2 (Tree 3844 ).
- the processor ( 120 ) may determine a level of the language data based on words included in the first sentence, words included in the 1-1 sentence, and words included in the 1-2 sentence.
- a plurality of words may be grouped by level of difficulty, and the language data may have a higher level when the data includes more words with a higher level of difficulty.
- the words are classified into level 1 to level 3 in FIG. 5 , methods for determining levels may vary. Levels of difficulty of such words may be determined with reference to school curricula (e.g., words for higher graders, words for lower graders, etc.).
- categories e.g., hobby, job, etc.
- categories may be determined depending on the types of words included in the language data. Specifically, first, a plurality of categories may be set in advance, and each corresponding category may include a plurality of words.
- “hobby” category may include various words such as ‘robot,’ ‘bike,’ ‘car,’ and ‘picture
- ‘job’ category may include various words such as ‘doctor,’ ‘singer,’ and ‘lawyer.’ Such words may be repeatedly included in a plurality of categories.
- the processor ( 120 ) of the management server ( 100 ) may analyze the words included in the language data, ascertain what category of words are included mostly, and then determine the category of the corresponding language data based on the result.
- the same word (e.g., robot) may be included in the language data multiple (e.g., five) times, and the processor ( 120 ) may count the multiple (e.g., five) same words multiple times (e.g., five) to determine the category.
- the processor 120
- the reason for this is that there is no need to reduce the frequency of a word used repeatedly.
- information of the number unique to each of the plurality of pieces of language data e.g., tree 3844 of FIG. 2
- the number of nodes e.g., 255
- the depth of the tree e.g., eight
- the processor ( 120 ) of the management server ( 100 ) may group the plurality of pieces of language data in a tree structure based on whether the pieces are similar with reference to a reference value.
- the processor ( 120 ) may group a plurality of pieces of language data by level based on the level of each piece of the language data stored in advance.
- the processor ( 120 ) may utilize the plurality of pieces of language data in English training programs according to an embodiment of the present invention.
- the processor ( 120 ) may allow a trainee to select one piece of first specific language data (first specific tree) among the plurality of pieces of language data for training.
- the processor ( 120 ) may generate at least one of various stories each composed of sentences from a sentence corresponding to the root node of the first specific language data (first specific tree) to a sentence corresponding to a leaf node of the first specific language data (first specific tree) at random (since there are a plurality of paths from the root node to the leaf node), and such a story can be considered as an example of a first training program.
- the processor ( 120 ) may provide the trainee with various stories (since there are a plurality of paths from the root node to the leaf node) using other language data having a high level of similarity to the first specific language data as a second training program.
- the word vector corresponding to the first specific language data and the word vector corresponding to the other language data may be included in the same cluster.
- the processor ( 120 ) may acquire language data having a high level of similarity or a certain level of similarity to the first specific language data by adjusting the reference value, and may provide the acquired language data (training program) to the terminal of the trainee next to the first specific language data depending on the purpose of training.
- the processor ( 120 ) of the present invention may utilize the plurality of pieces of language data as a pre-processing process for training AI language models.
- the processor ( 120 ) may cause an operator to select second specific language data to proceed with training of an AI language model. Since an enormous amount of refined data is needed to proceed with training of an AI language model, the processor ( 120 ) of the management server ( 100 ) may provide the terminal of the operator with other language data having a high level of similarity with the second specific language data.
- the processor ( 120 ) may control the number of pieces of language data having a high level of similarity to the second specific language data by adjusting the reference value.
- the reason for this is that the total number of pieces of language data included in the cluster in which the second specific language data is included may vary depending on the reference value.
- the processor ( 120 ) may provide the terminal of the operator with other language data including the contents of the movie to support training.
- the embodiments according to the present invention described above may be implemented in the form of program commands that can be executed by various computer components and recorded in a computer-readable recording medium.
- the computer-readable recording medium includes program commands, data files, data structures, and the like alone or in combination.
- the program commands recorded on the computer-readable recording medium may be specially designed and configured for the present invention, or may be known to and usable by those skilled in the art of computer software.
- Examples of computer-readable recording medium include hardware devices specially configured to store and execute program commands, such as hard disks, ROMs, RAMs, flash memories, and the like.
- Examples of program commands include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine language codes generated by a compiler.
- the hardware devices may be configured to work as one or more software modules in order to execute processing according to the present invention, and vice versa
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
Abstract
In a state in which the language data in a tree structure includes at least one node and the at least one node includes at least one word, a plurality of word vectors including a first word vector and a second word vector is generated by a management server, based on the number of words included in each of a plurality of pieces of language data, and a dot product function of the first word vector and the second word vector is used by the management server to measure a score of similarity between first language data corresponding to the first word vector and second language data corresponding to the second word vector so that language data is managed for determining similarity by a management server.
Description
- This application claims benefit under 35 U.S.C. 119, 120, 121, or 365 (c), and is a National Stage entry from International Application No. PCT/KR2022/012565 filed on Aug. 23, 2022, the entire contents of which are incorporated herein by reference.
- The present invention relates to a method for managing language data and a server using the same, and particularly relates to a method for managing language data for determining similarity, the method including, in a state in which the language data in a tree structure includes at least one node and the at least one node includes at least one word, generating, by a management server, a plurality of word vectors including a first word vector and a second word vector based on the number of words included in each of a plurality of pieces of language data, and using, by the management server, the dot product function of the first word vector and the second word vector to measure a score of similarity between first language data corresponding to the first word vector and second language data corresponding to the second word vector.
- Technologies have been developed in a variety of industry fields for self-driving cars, AI humanoids, robots as AI technologies develop. Such technologies have been utilized in the field of language particularly to provide various kinds of techniques and services such as recognition, translation, AI chatbots, and the like.
- The field of language requires a wide range of subjects and diverse language data in order to improve the training abilities of artificial intelligence machines. In such a case, indiscriminate data are not beneficial to improvement in the training abilities, and thus a process of establishing a relationship between pieces of various language data is necessary.
- In this background, the present inventor proposes a method for managing language data and a server using the same herein.
- Objectives of the present invention are as follows.
- An objective of the present invention is to support an AI training process by classifying a plurality of pieces of language data by category or level and storing the data.
- In addition, another objective of the present invention is to determine the level of similarity of each of a plurality of pieces of language data and classify the results in advance so that the data is utilized.
- According to an aspect of the present invention, a method for managing language data for determining similarity includes, in a state in which the language data in a tree structure includes at least one node and the at least one node includes at least one word, a step of generating, by a management server, a plurality of word vectors including a first word vector and a second word vector based on the number of words included in each of a plurality of pieces of language data, and a step of using, by the management server, a dot product function of the first word vector and the second word vector to measure a score of similarity between first language data corresponding to the first word vector and second language data corresponding to the second word vector.
- In addition, according to another aspect of the present invention, the management server may concatenate at least one word included in each of the plurality of pieces of language data, eliminate a stop word, and thus acquire a plurality of pieces of valid data, and generate the plurality of word vectors based on the number of keywords included in each piece of the plurality of valid data.
- In addition, according to another aspect of the present invention, when a score of similarity between the first language data and the second language data is greater than a reference value, the management server may group the first word vector corresponding to the first language data and the second word vector corresponding to the second language data to generate and express one cluster on a graph.
- In addition, according to another aspect of the present invention, in a state in which scores of similarity between the plurality of pieces of language data have been measured, the reference value may include a first reference value, a second reference value, and a third reference value, and the magnitudes of the first reference value, the second reference value, and the third reference value may sequentially increase, the management server may group word vectors of a pair of pieces of language data having a score of similarity higher than the first reference value together and then generate a plurality of first clusters on a graph, group word vectors of a pair of pieces of language data having a score of similarity higher than the second reference value together and then generate a plurality of second clusters on the graph, group word vectors of a pair of pieces of language data having a score of similarity higher than the third reference value together and then generate a plurality of third clusters on the graph, and acquire the second reference value satisfying a condition that the number of the plurality of second clusters is greater than the number of the plurality of first clusters or the number of the plurality of third clusters.
- In addition, according to another aspect of the present invention, when the language data includes a plurality of nodes constituted by one sentence and the plurality of nodes include a first node, a 1-1 node corresponding to a left child node of the first node, and a 1-2 node corresponding to a right child node of the first node, a 1-1 sentence included in the 1-1 node and a 1-2 sentence included in the 1-2 node may correspond to a response to a first sentence included in the first node.
- In addition, according to another aspect of the present invention, a level of the language data may be determined based on a word included in the first sentence, a word included in the 1-1 sentence, and a word included in the 1-2 sentence.
- In addition, according to another aspect of the present invention, each of the plurality of pieces of language data in the tree structure may have the same depth.
- In addition, according to another aspect of the present invention, a management server that manages language data for determining similarity includes a communication unit, a database, and a processor that, in a state in which the language data in a tree structure includes at least one node and the at least one node includes at least one word, generates a plurality of word vectors including a first word vector and a second word vector based on the number of words included in each of a plurality of pieces of language data, and uses a dot product function of the first word vector and the second word vector to measure a score of similarity between first language data corresponding to the first word vector and second language data corresponding to the second word vector.
- According to the present invention described above, the following effects can be exhibited.
- According to the present invention, the effect of supporting an AI training process by classifying a plurality of pieces of language data by category or level and storing the data can be exhibited.
- In addition, according to the present invention, the effect of determining the level of similarity of each of a plurality of pieces of language data and classifying the results in advance so that the data is utilized can be exhibited.
-
FIG. 1 is a diagram illustrating an overall configuration of a management server according to an embodiment of the present invention. -
FIG. 2 is a diagram illustrating language data in a tree structure according to an embodiment of the present invention. -
FIG. 3 is a diagram illustrating a process of measuring a score of similarity between a plurality of pieces of language data according to an embodiment of the present invention. -
FIG. 4 are graphs showing the number of clusters generated based on the magnitudes of reference values according to an embodiment of the present invention. -
FIG. 5 is a table showing pieces of information of language data according to an embodiment of the present invention. - Detailed description of the present invention to be described below refers to appended diagrams which illustrate specific embodiments in which the present invention may be implemented as examples. Such embodiments are described in detail sufficiently for a person skilled in the art to implement the present invention. It should be understood that various embodiments of the present invention differ from each other, but need not be mutually exclusive. For example, specific forms, structures, and features described herein relating to an embodiment can be implemented in another embodiment without departing from the spirit and scope of the present invention. In addition, it should be understood that the position or arrangement of individual components included in each disclosed embodiment can be changed without departing from the spirit and scope of the present invention. Therefore, the detailed description provided below is not intended to be limited, and the scope of the present invention is limited only by the claims together with all equivalents to what the claims assert, if properly stated. Similar reference signs in the drawings denote the same or similar functions in various aspects.
- In order to enable a person with common knowledge of the technical to which the present invention belongs to easily implement the present invention, exemplary embodiments of the present invention will be described below in detail with reference to the appended drawings.
-
FIG. 1 is a diagram illustrating an overall configuration of a management server according to an embodiment of the present invention. - A management server (100) of the present invention includes a communication unit (110) and a processor (120) as illustrated in
FIG. 1 , and may not include a database (130) in some cases, unlike inFIG. 1 . - The communication unit (110) of the management server (200) can be realized by using any of various communication technologies. That is, Wi-Fi, wideband CDMA (WCDMA), High-Speed Downlink Packet Access (HSDPA), High-Speed Uplink Packet Access (HSUPA), High-Speed Packet Access (HSPA), mobile WiMAX, Wi-Bro, Long Term Evolution, 5G, 6G, Bluetooth, Infrared Data Association (IrDA), Near Field Communication (NFC), Zigbee, a wireless LAN technology, and the like can be applied thereto. In addition, the communication unit may comply with TCP/IP that is a standard protocol for transmitting information on the Internet when a service is to be provided in connection to the Internet.
- Next, the database (13) of the invention of the present specification can store information of a plurality of pieces of language data (e.g., tree numbers, the number of nodes, node depths, etc.). When an external database is used, the management server (100) is able to access the external database through the communication unit (110).
- In addition, the processor (120) can include hardware configurations such as a micro-processing unit (MPU) or a central processing unit (CPU), a cache memory, and a data bus. Furthermore, the management server (200) can further include software configurations such as an operating system and an application that accomplishes a specific goal.
- In addition, an administrator can transmit and receive information to and from the management server (100) through a terminal (not illustrated), and with respect to the terminal, a desktop computer, a notebook computer, a work station, a PDA, a web pad, a mobile telephone, a smart remote controller, any of various IoT main devices may correspond to a terminal according to the present invention as long as it can perform communication, is equipped with a memory section, and has a micro-processor mounted therein to be able to perform arithmetic operations.
- First, language data will be described with reference to
FIG. 2 . -
FIG. 2 is a diagram illustrating language data in a tree structure according to an embodiment of the present invention. - Language data that the present invention deals with may have a tree structure and include at least one node. Here, the at least one node may include one sentence.
- One piece of language data has a tree structure, and may include a plurality of nodes and sentences such as a root node (“Do you like robots?”), child nodes (“Yes, I have a robot collection at home,” and “No, I do not like robots”), grandchild nodes (“What materials are your robots made of?,” and the like), lower nodes, and leaf nodes (“Yes, I have one of those,” “Yes, I have a Lego robot,” and the like) as illustrated in
FIG. 2 . - Although the language data of the present invention has a binary tree structure as illustrated in
FIG. 2 , the language data (in the tree structure) may have one node with three or more child nodes, or may have only one child node in some cases. In other words, the language data of the present invention may have any of various tree structure including one root node. - In addition, the trees of the plurality of pieces of language data may all have the same depth (level). For example, the trees may have a depth (level) 8 as saturated binary trees, and thus all of the language data (trees) may include 255 nodes and 255 sentences. Of course, the trees of the plurality of pieces of language data may have different depths and different numbers of nodes (sentences) in some cases.
- For convenience of description, it is assumed that a plurality of nodes included in any one piece of language data include a first node, a 1-1 node corresponding to the left child node of the first node, and a 1-2 node corresponding to the right child node of the first node.
- In this case, a 1-1 sentence included in the 1-1 node and a 1-2 sentence included in the 1-2 node may be responses to a first sentence included in the first node. As can be seen from
FIG. 2 , responses to “Do you have a Lego robot, too?” included in the first node may be “Yes, I have one of those” of the 1-1 node and “Yes, I have a Lego robot” of the 1-2 node. - Because the responses to the sentence included in the parent node are included in the child nodes, all sentences (corresponding to nodes) included in one piece of language data (in one tree structure) may be related to each other (in terms of the same subject and same category).
- For reference, the sentences (nodes) included in the plurality of pieces of language data are acquired from a terminal (input) of an operator, and the tree structure may be determined by the operator as well based on the input from the terminal. The plurality of pieces of language data generated through the above-described process can be stored in the database (130).
- A process performed by the processor (120) of the management server (100) of the present invention will be discussed below with reference to
FIG. 3 . -
FIG. 3 is a diagram illustrating a process of measuring a score of similarity between a plurality of pieces of language data according to an embodiment of the present invention. - The processor (120) of the management server (100) can generate a plurality of word vectors based on the number of words included in each of the plurality of pieces of language data (S310). At this time, the plurality of word vectors may include a first word vector and a second word vector.
- Specifically, the processor (120) of the management server (100) concatenates at least one word included in each of the plurality of pieces of language data, eliminate stop words, and thereby can acquire a plurality of pieces of valid data. Since language data includes a plurality of nodes (sentences) as described above, the processor (120) can eliminate the stop words included in the sentences by concatenating the plurality of sentences into one.
- The processor (120) can eliminate not only the stop words but also punctuations and typos.
- One piece of valid data (data created by concatenating one word or sentence excluding stop words) can be acquired from a certain piece of language data, and the processor (120) can generate a plurality of word vectors based on the number of keywords included in each of a plurality of pieces of acquired valid data.
- Here, a variety of keywords such as ‘rain,’ ‘weather,’ and ‘music’ may be the examples, and they may be set in advance. In some cases, such keywords may not be set in advance and may be determined based on the words included in each piece of the valid data (e.g., a predetermined number of words may be set as keywords, the words being selected from those having been appeared in the plurality of pieces of valid data a larger number of times in descending order). Keywords may vary in each category to be described below.
- If the keywords are ‘rain,’ ‘weather,’ ‘robot,’ and ‘music,’ for example, the processor (120) may count how many times each of the keywords is included in the plurality of pieces of language data (valid data), and then generate multi-dimensional (e.g., four-dimensional) word vectors based on the results.
- Although only the keywords ‘rain,’ ‘weather,’ ‘robot,’ and ‘music’ are considered in the above description, practically more than 300 keywords can be considered, and thus the word vectors in that case may also be about 300-dimensional. In addition, the keywords may be set in advance, but may vary depending on words included in the plurality of pieces of language data.
- Word vectors according to an embodiment of the present invention are as follows.
- First, the total number of words included in specific language data (specific tree), the frequency of appearance of a keyword in the specific language data, the number of the plurality of pieces of language data (trees), and the number of pieces of language data (trees) including the keyword can be calculated.
- Next, a term frequency (TF: calculated with the formula “the frequency of appearance of a keyword in the specific language data/the total number of words included in the specific language data”), and an inverse document frequency (IDF, calculated with the formula “Log (the number of the plurality of pieces of language data (trees)/the number of pieces of language data (trees) including the keyword)”) may be calculated.
- In addition, the processor may acquire a calculated value for each keyword as the result of TF (the frequency of appearance of a keyword in the specific language data/the total number of words included in the specific language data)×IDF (Log (the number of the plurality of pieces of language data/the number of pieces of language data including the keyword)), and as a result, the processor can generate a plurality of multi-dimensional word vectors (0.52, 0.48, 0.28, . . . ).
- In other words, the plurality of word vectors can be generated by using so-called Terms Frequency-Inverse Document Frequency (TF-IDF) in the present invention.
- Next, the processor (120) of the management server (100) can use the dot product function between the first word vector and the second word vector to measure a score of similarity between first language data corresponding to the first word vector and second language data corresponding to the second word vector (S320).
- Here, the dot product function is a kind of cosine function from which a score of similarity between the multi-dimensional (e.g., 300-dimensional) first word vector and second word vector can be measured. The score of similarity may be the result of the dot product function greater than zero and smaller than 1.
- For example, when there is only the keyword ‘rain’ and the keyword ‘rain’ appears 10 times in the first language data, 20 times in the second language data, and one time in third language data, the score of similarity between the first word vector (corresponding to the first language data) and the second word vector (corresponding to the second language data) may be higher than the score of similarity between the first word vector and the third word vector (corresponding to the third language data).
- In other words, there will be a high probability that, as the two pieces of language data have a higher score of similarity, they are more highly correlated, include the same words, and share the same subject and belong to the same category.
- When the score (e.g., 0.8) of similarity between the first language data and the second language data is higher than a reference score (e.g., 0.6) according to an embodiment of the present disclosure, the processor (120) of the management server (100) groups the first word vector corresponding to the first language data and the second word vector corresponding to the second language data, and then creates one cluster on a graph (a graph on which multi-dimensional word vectors can be plotted) to represent the word vectors.
- Here, the word vectors mean the point of each type of language data marked on the multi-dimensional graph, and the cluster mean the group in which the word vectors are concatenated. In other words, each cluster means one subject (category), and there will be a high probability that the word vectors (language data) included in one cluster share the same subject.
- According to an embodiment of the present invention, it can be set that one of the word vectors is not allowed to constitute the cluster and two or more concatenated word vectors can constitute one cluster.
- Although only the first word vector, the second word vector, and the third word vector have been mentioned for convenience of description, scores of similarity between a plurality of word vectors can be measured.
- A process in which a score of similarity between a pair of word vectors (e.g., the first word vector and the second word vector) among a plurality of word vectors is measured and expressed will be described below with reference to
FIG. 4 . -
FIG. 4 are graphs showing the number of clusters created based on the magnitudes of reference values according to an embodiment of the present invention. For reference, the graphs are merely represented in two dimensions, but may actually correspond to graphs representing multi-dimensions (e.g., 300-dimensions). - It may be assumed that scores of similarity between a plurality of pieces of language data have been measured, the reference values include at least a first reference value, a second reference value, and a third reference value, and the magnitudes of the first reference value, the second reference value, and the third reference value sequentially increase. In other words, “first reference value<second reference value<third reference value” is assumed.
- At this time, the processor (120) of the management server (100) groups the word vectors of a pair of pieces of language data having a score of similarity higher than the first reference value together and then generates a plurality of first clusters on a graph.
- In addition, the processor (120) groups the word vectors of a pair of pieces of language data having a score of similarity higher than the second reference value together and then generates a plurality of second clusters on the graph, and groups the word vectors of a pair of pieces of language data having a score of similarity higher than the third reference value together and then generates a plurality of third clusters on the graph.
- In this state, the processor (120) can acquire the second reference value satisfying a condition that the number of the plurality of second clusters is greater than the number of the plurality of first clusters or the number of the plurality of third clusters.
- Referring to
FIG. 4 , the number of clusters is one when the reference value is 0.1, the number of clusters is eight when the reference value is 0.2, the number of clusters is 33 when the reference value is 0.3, the number of clusters is 53 when the reference value is 0.4, the number of clusters is 70 when the reference value is 0.5, the number of clusters is 75 when the reference value is 0.6, the number of clusters is 61 when the reference value is 0.7, the number of clusters is 27 when the reference value is 0.8, and the number of clusters is five when the reference value is 0.9. - Referring to
FIG. 4 , there is one large-sized cluster when the reference value is 0.1, but when the reference value is 0.2, the word vectors can be divided into eight small-sized clusters. This means that the number of clusters increases as the reference value becomes higher. However, pairs of pieces of language data (word vectors) having a low score of similarity are no longer concatenated when the reference value becomes higher than 0.6, and thus the number of clusters may become smaller, which is reflectedFIG. 4 . - In more detail, it can be seen that the number of clusters is 70 when the reference value is 0.5, the number of clusters is 75 when the reference value is 0.6, and the number of clusters is 61 when the reference value is 0.7.
- Since the number of clusters is greater when the reference value is 0.6 than when the reference value is 0.5 or 0.7, the processor (120) may acquire 0.6 as the second reference value. In other words, the processor (120) may acquire the largest number of clusters (categories or subjects) when the reference value is 0.6 according to an embodiment of the present invention.
- When the pieces of a pair of language data having a score of similarity higher than the reference value are concatenated with each other to form a cluster, there are a plurality of pairs of language data having scores of similarity higher than a low reference value (e.g., 0.1), the pieces of the language data may be concatenated with each other (due to the low reference value), and thus a small number of clusters (e.g., one) may be generated. The clusters at this time may include a large number of word vectors (language data).
- On the other hand, there may be a small number of pairs of language data having scores of similarity higher than a high reference value (e.g., 0.9), and thus fewer (e.g., five) clusters may be similarities generated. The clusters at this time may include fewer word vectors (language data).
- As a result, the number of clusters may increase and then decrease as the reference value becomes higher. Considering
FIG. 4 , the largest number of clusters is 75, and the reference value at that time is 0.6. Depending on various situations of the types and the number of pieces of language data, the number of clusters for each reference value and the acquired reference value (0.6 in the case ofFIG. 4 ) may be different from those ofFIG. 4 . - The processor (120) of the present invention may determine whether pieces of language data are concatenated by adjusting the reference values described above, and thereby may determine the number, the shape, and the like of clusters accordingly. Word vectors (language data) included in one cluster may be determined to be similar (to have the same category or subject).
-
FIG. 5 is a table showing pieces of information of language data according to an embodiment of the present invention. - As described above, language data of the present invention may have a tree structure in which a plurality of nodes (sentences) are included. For convenience of description, it can be assumed that the plurality of nodes includes a first node (a first sentence), a 1-1 node (a 1-1 sentence), a 1-2 node (a 1-2 sentence), and the like, as in
FIG. 2 (Tree 3844). - In this case, the processor (120) may determine a level of the language data based on words included in the first sentence, words included in the 1-1 sentence, and words included in the 1-2 sentence.
- Specifically, a plurality of words may be grouped by level of difficulty, and the language data may have a higher level when the data includes more words with a higher level of difficulty. Although the words are classified into
level 1 tolevel 3 inFIG. 5 , methods for determining levels may vary. Levels of difficulty of such words may be determined with reference to school curricula (e.g., words for higher graders, words for lower graders, etc.). - In addition, categories (e.g., hobby, job, etc.) may be determined depending on the types of words included in the language data. Specifically, first, a plurality of categories may be set in advance, and each corresponding category may include a plurality of words.
- For example, “hobby” category may include various words such as ‘robot,’ ‘bike,’ ‘car,’ and ‘picture,’ and ‘job’ category may include various words such as ‘doctor,’ ‘singer,’ and ‘lawyer.’ Such words may be repeatedly included in a plurality of categories.
- The processor (120) of the management server (100) may analyze the words included in the language data, ascertain what category of words are included mostly, and then determine the category of the corresponding language data based on the result.
- The same word (e.g., robot) may be included in the language data multiple (e.g., five) times, and the processor (120) may count the multiple (e.g., five) same words multiple times (e.g., five) to determine the category. The reason for this is that there is no need to reduce the frequency of a word used repeatedly.
- In addition, information of the number unique to each of the plurality of pieces of language data (e.g.,
tree 3844 ofFIG. 2 ), the number of nodes (e.g., 255), and the depth of the tree (e.g., eight) as shown inFIG. 5 can be recorded in the database (130). - As described above, the processor (120) of the management server (100) according to the present invention may group the plurality of pieces of language data in a tree structure based on whether the pieces are similar with reference to a reference value. In addition, the processor (120) may group a plurality of pieces of language data by level based on the level of each piece of the language data stored in advance.
- In that state, the processor (120) may utilize the plurality of pieces of language data in English training programs according to an embodiment of the present invention.
- The processor (120) may allow a trainee to select one piece of first specific language data (first specific tree) among the plurality of pieces of language data for training.
- The processor (120) may generate at least one of various stories each composed of sentences from a sentence corresponding to the root node of the first specific language data (first specific tree) to a sentence corresponding to a leaf node of the first specific language data (first specific tree) at random (since there are a plurality of paths from the root node to the leaf node), and such a story can be considered as an example of a first training program.
- After the first training program about the first training program (from the root node to the leaf node) ends, the processor (120) may provide the trainee with various stories (since there are a plurality of paths from the root node to the leaf node) using other language data having a high level of similarity to the first specific language data as a second training program. At this time, the word vector corresponding to the first specific language data and the word vector corresponding to the other language data may be included in the same cluster.
- The processor (120) may acquire language data having a high level of similarity or a certain level of similarity to the first specific language data by adjusting the reference value, and may provide the acquired language data (training program) to the terminal of the trainee next to the first specific language data depending on the purpose of training.
- In addition, the processor (120) of the present invention may utilize the plurality of pieces of language data as a pre-processing process for training AI language models.
- Specifically, the processor (120) may cause an operator to select second specific language data to proceed with training of an AI language model. Since an enormous amount of refined data is needed to proceed with training of an AI language model, the processor (120) of the management server (100) may provide the terminal of the operator with other language data having a high level of similarity with the second specific language data.
- At this time, the processor (120) may control the number of pieces of language data having a high level of similarity to the second specific language data by adjusting the reference value. The reason for this is that the total number of pieces of language data included in the cluster in which the second specific language data is included may vary depending on the reference value.
- Thus, when the operator selects the second specific language data including contents of a movie, the processor (120) may provide the terminal of the operator with other language data including the contents of the movie to support training.
- The embodiments according to the present invention described above may be implemented in the form of program commands that can be executed by various computer components and recorded in a computer-readable recording medium. The computer-readable recording medium includes program commands, data files, data structures, and the like alone or in combination. The program commands recorded on the computer-readable recording medium may be specially designed and configured for the present invention, or may be known to and usable by those skilled in the art of computer software. Examples of computer-readable recording medium include hardware devices specially configured to store and execute program commands, such as hard disks, ROMs, RAMs, flash memories, and the like. Examples of program commands include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine language codes generated by a compiler. The hardware devices may be configured to work as one or more software modules in order to execute processing according to the present invention, and vice versa
- Although the present invention has been described above by using specific factors such as specific constituent elements, limited embodiments, and drawings, they are provided only to help general understanding of the invention. The present invention is not limited by the embodiments, and those with common knowledge in the technical field to which the present invention belongs can seek various modifications and alterations in these descriptions.
- Therefore, the idea of the present invention should not be limited to and defined by the above-described embodiments, and not only the scope of claims to be described later but also all equivalents or equivalent modifications to the scope of this patent claim fall within the scope of the idea of the present invention.
Claims (8)
1. A method for managing language data for determining similarity, the method comprising:
in a state in which the language data in a tree structure includes at least one node and the at least one node includes at least one word,
(a) generating, by a management server, a plurality of word vectors including a first word vector and a second word vector based on the number of words included in each of a plurality of pieces of language data; and
(b) using, by the management server, a dot product function of the first word vector and the second word vector to measure a score of similarity between first language data corresponding to the first word vector and second language data corresponding to the second word vector.
2. The method for managing language data according to claim 1 , wherein the management server concatenates at least one word included in each of the plurality of pieces of language data, eliminates a stop word, and thus acquires a plurality of pieces of valid data, and generates the plurality of word vectors based on the number of keywords included in each piece of the plurality of valid data.
3. The method for managing language data according to claim 1 , wherein, when a score of similarity between the first language data and the second language data is greater than a reference value, the management server groups the first word vector corresponding to the first language data and the second word vector corresponding to the second language data to generate and express one cluster on a graph.
4. The method for managing language data according to claim 1 , wherein, in a state in which scores of similarity between the plurality of pieces of language data have been measured, the reference value includes a first reference value, a second reference value, and a third reference value, and the magnitudes of the first reference value, the second reference value, and the third reference value sequentially increase, the management server groups word vectors of a pair of pieces of language data having a score of similarity higher than the first reference value together and then generates a plurality of first clusters on a graph, groups word vectors of a pair of pieces of language data having a score of similarity higher than the second reference value together and then generates a plurality of second clusters on the graph, groups word vectors of a pair of pieces of language data having a score of similarity higher than the third reference value together and then generates a plurality of third clusters on the graph, and acquires the second reference value satisfying a condition that the number of the plurality of second clusters is greater than the number of the plurality of first clusters or the number of the plurality of third clusters.
5. The method for managing language data according to claim 1 , wherein, when the language data includes a plurality of nodes constituted by one sentence and the plurality of nodes include a first node, a 1-1 node corresponding to a left child node of the first node, and a 1-2 node corresponding to a right child node of the first node, a 1-1 sentence included in the 1-1 node and a 1-2 sentence included in the 1-2 node corresponds to a response to a first sentence included in the first node.
6. The method for managing language data according to claim 5 , wherein a level of the language data is determined based on a word included in the first sentence, a word included in the 1-1 sentence, and a word included in the 1-2 sentence.
7. The method for managing language data according to claim 1 , wherein each of the plurality of pieces of language data in the tree structure has the same depth.
8. A management server configured to manage language data for determining similarity, the management server comprising:
a communication unit;
a database; and
a processor configured to, in a state in which the language data in a tree structure includes at least one node and the at least one node includes at least one word, generate a plurality of word vectors including a first word vector and a second word vector based on the number of words included in each of a plurality of pieces of language data, and use a dot product function of the first word vector and the second word vector to measure a score of similarity between first language data corresponding to the first word vector and second language data corresponding to the second word vector.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/KR2022/012565 WO2024043355A1 (en) | 2022-08-23 | 2022-08-23 | Language data management method and server using same |
Publications (1)
Publication Number | Publication Date |
---|---|
US20240362263A1 true US20240362263A1 (en) | 2024-10-31 |
Family
ID=90013561
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/039,384 Pending US20240362263A1 (en) | 2022-08-23 | 2022-08-23 | Method for managing language data and server using the same |
Country Status (2)
Country | Link |
---|---|
US (1) | US20240362263A1 (en) |
WO (1) | WO2024043355A1 (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060224584A1 (en) * | 2005-03-31 | 2006-10-05 | Content Analyst Company, Llc | Automatic linear text segmentation |
US20180253496A1 (en) * | 2017-02-28 | 2018-09-06 | Laserlike Inc. | Interest embedding vectors |
US20190079753A1 (en) * | 2017-09-08 | 2019-03-14 | Devfactory Fz-Llc | Automating Generation of Library Suggestion Engine Models |
US20190266288A1 (en) * | 2018-02-28 | 2019-08-29 | Laserlike, Inc. | Query topic map |
US20200020325A1 (en) * | 2018-07-12 | 2020-01-16 | Aka Intelligence Inc. | Method for generating chatbot utterance based on semantic graph database |
US20210357378A1 (en) * | 2020-05-12 | 2021-11-18 | Hubspot, Inc. | Multi-service business platform system having entity resolution systems and methods |
US11294974B1 (en) * | 2018-10-04 | 2022-04-05 | Apple Inc. | Golden embeddings |
US20220171945A1 (en) * | 2020-12-01 | 2022-06-02 | Rammer Technologies, Inc. | Determining conversational structure from speech |
US20230061341A1 (en) * | 2021-08-29 | 2023-03-02 | Technion Research & Development Foundation Limited | Database record lineage and vector search |
US20230103834A1 (en) * | 2021-09-30 | 2023-04-06 | Nasdaq, Inc. | Systems and methods of natural language processing |
US20230229680A1 (en) * | 2022-01-14 | 2023-07-20 | Sap Se | Auto-generation of support trees |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6918097B2 (en) * | 2001-10-09 | 2005-07-12 | Xerox Corporation | Method and apparatus for displaying literary and linguistic information about words |
KR101130457B1 (en) * | 2004-11-04 | 2012-03-28 | 마이크로소프트 코포레이션 | Extracting treelet translation pairs |
WO2015145981A1 (en) * | 2014-03-28 | 2015-10-01 | 日本電気株式会社 | Multilingual document-similarity-degree learning device, multilingual document-similarity-degree determination device, multilingual document-similarity-degree learning method, multilingual document-similarity-degree determination method, and storage medium |
KR102375511B1 (en) * | 2020-07-24 | 2022-03-17 | 주식회사 한글과컴퓨터 | Document storage management server for performing storage processing of document files received from a client terminal in conjunction with a plurality of document storage and operating method thereof |
CN112818686B (en) * | 2021-03-23 | 2023-10-31 | 北京百度网讯科技有限公司 | Domain phrase mining methods, devices and electronic devices |
-
2022
- 2022-08-23 US US18/039,384 patent/US20240362263A1/en active Pending
- 2022-08-23 WO PCT/KR2022/012565 patent/WO2024043355A1/en unknown
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060224584A1 (en) * | 2005-03-31 | 2006-10-05 | Content Analyst Company, Llc | Automatic linear text segmentation |
US20180253496A1 (en) * | 2017-02-28 | 2018-09-06 | Laserlike Inc. | Interest embedding vectors |
US20190079753A1 (en) * | 2017-09-08 | 2019-03-14 | Devfactory Fz-Llc | Automating Generation of Library Suggestion Engine Models |
US20190266288A1 (en) * | 2018-02-28 | 2019-08-29 | Laserlike, Inc. | Query topic map |
US20200020325A1 (en) * | 2018-07-12 | 2020-01-16 | Aka Intelligence Inc. | Method for generating chatbot utterance based on semantic graph database |
US11294974B1 (en) * | 2018-10-04 | 2022-04-05 | Apple Inc. | Golden embeddings |
US20210357378A1 (en) * | 2020-05-12 | 2021-11-18 | Hubspot, Inc. | Multi-service business platform system having entity resolution systems and methods |
US20220171945A1 (en) * | 2020-12-01 | 2022-06-02 | Rammer Technologies, Inc. | Determining conversational structure from speech |
US20230061341A1 (en) * | 2021-08-29 | 2023-03-02 | Technion Research & Development Foundation Limited | Database record lineage and vector search |
US20230103834A1 (en) * | 2021-09-30 | 2023-04-06 | Nasdaq, Inc. | Systems and methods of natural language processing |
US20230229680A1 (en) * | 2022-01-14 | 2023-07-20 | Sap Se | Auto-generation of support trees |
Also Published As
Publication number | Publication date |
---|---|
WO2024043355A1 (en) | 2024-02-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9424294B2 (en) | Method for facet searching and search suggestions | |
CN103914494B (en) | Method and system for identifying identity of microblog user | |
CN108710662B (en) | Language conversion method and device, storage medium, data query system and method | |
US10891295B2 (en) | Methods and systems using linear expressions for machine learning models to rank search results | |
US20180285742A1 (en) | Learning method, learning apparatus, and storage medium | |
Rivero et al. | Benchmarking data exchange among semantic-web ontologies | |
US8484221B2 (en) | Adaptive routing of documents to searchable indexes | |
US20170124090A1 (en) | Method of discovering and exploring feature knowledge | |
US9223833B2 (en) | Method for in-loop human validation of disambiguated features | |
US20240362263A1 (en) | Method for managing language data and server using the same | |
CN118861203A (en) | A text search method, system, device and medium based on vector database | |
CN112243247B (en) | Base station optimization priority determination method, device and computing equipment | |
US11003697B2 (en) | Cluster computing system and method for automatically generating extraction patterns from operational logs | |
US11983209B1 (en) | Partitioning documents for contextual search | |
US20130080474A1 (en) | Accelerating recursive queries | |
CN114297046A (en) | Event obtaining method, device, equipment and medium based on log | |
CN106339446A (en) | Dispersed key value index establishing method and system | |
Ulmer et al. | Massively parallel acceleration of a document-similarity classifier to detect web attacks | |
CN114298327A (en) | Data processing method, device and storage medium for federated learning model | |
CN113935397A (en) | Intelligent interaction method and device | |
US10223405B2 (en) | Retrieval control method and retrieval server | |
US11741058B2 (en) | Systems and methods for architecture embeddings for efficient dynamic synthetic data generation | |
CN113568929B (en) | Data storage and query method and device and electronic equipment | |
Gorawski et al. | Use of grammars and machine learning in ETL systems that control load balancing process | |
EP2894592A1 (en) | System and method for identifying related elements with respect to a query in a repository |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: AKA AI CO., LTD., KOREA, REPUBLIC OF Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:JUNG, MYUNG WON;KWON, EUN HYE;NAM, HOO RAM;AND OTHERS;SIGNING DATES FROM 20230518 TO 20230523;REEL/FRAME:063794/0265 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |