US9754176B2 - Method and system for data extraction from images of semi-structured documents - Google Patents
Method and system for data extraction from images of semi-structured documents Download PDFInfo
- Publication number
- US9754176B2 US9754176B2 US14/868,683 US201514868683A US9754176B2 US 9754176 B2 US9754176 B2 US 9754176B2 US 201514868683 A US201514868683 A US 201514868683A US 9754176 B2 US9754176 B2 US 9754176B2
- Authority
- US
- United States
- Prior art keywords
- document
- hypotheses
- text
- image
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 120
- 238000013075 data extraction Methods 0.000 title description 3
- 239000012634 fragment Substances 0.000 claims abstract description 88
- 238000012015 optical character recognition Methods 0.000 claims description 17
- 230000001186 cumulative effect Effects 0.000 claims description 12
- 230000015654 memory Effects 0.000 description 13
- 238000013459 approach Methods 0.000 description 7
- 238000000605 extraction Methods 0.000 description 6
- 230000014509 gene expression Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000000153 supplemental effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
-
- G06K9/18—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/412—Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
-
- G06K9/4604—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/22—Character recognition characterised by the type of writing
- G06V30/224—Character recognition characterised by the type of writing of printed characters having additional code marks or containing code marks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/414—Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
Definitions
- the present invention relates to the field of data extraction from images of documents by means of Optical Character Recognition (OCR). More specifically, the present invention relates to utilizing graph based approach, the cascade classification approach, as well as a reduced alphabet approach, to reduce fields' identification errors in processing textual information in the images of documents.
- OCR Optical Character Recognition
- the graph enables us to save the results of all search procedures and to implement an examination of different combinations of the fields during further analysis of the search results.
- our method allows to organize the work of the field search procedures by cascade classification, which allows us to save computational resources and to calculate only the required number of features for display.
- our method uses a reduced alphabet technique for generating dictionaries of keywords, which decreases the number of mistakes in the fields identifying by the search procedure employing the dictionaries of keywords.
- the present invention is directed to a method of extracting data from fields in an image of a document.
- a text representation of the image of the document is obtained.
- a graph for storing features of the text fragments in the text representation of the image of the document and their links is constructed.
- a cascade classification for computing the features of the text fragments in the text representation of the image of the document and their link is run.
- Hypotheses about the belonging of text fragments to the fields in the image of the document are generated.
- Combinations of the hypotheses are generated.
- a combination of the hypotheses is selected.
- data from the fields in the image of the document is extracted based on the selected combination of the hypotheses.
- FIG. 1 illustrates the method of automated extraction of meaningful information from a semi-structured document.
- FIG. 2 illustrates the key steps of the entity extraction.
- FIG. 3 illustrates the method of identifying the fields in a semi-structured document utilizing a graph and a cascade classification.
- FIG. 4 illustrates the method of the cascade classification.
- FIG. 5 illustrates an example of graph constructed for a business card.
- FIG. 6 provides a general architectural diagram for various types of computers and other processor-controlled devices.
- FIG. 7 illustrates an example of word representing in a dictionary using the technique of the reduced alphabet.
- FIG. 8 illustrates an example of a word in the reduced alphabet that corresponds to two source words.
- field means information in a document that needs to be identified and extracted from the document.
- FIG. 1 Shown in FIG. 1 is a general illustration of the method of automated extraction of meaningful information from an image of a semi-structured document.
- the semi-structured document is a document containing a set of information fields (a document element intended for data extraction) whose design, number and layout may vary significantly in different versions of the document.
- An example of such semi-structured document could be a sales receipt or a business card, although the present method and corresponding system should not be limited to sales receipts or business cards.
- an image of a document is provided to be input into the system.
- the image is pre-processed at a stage of preliminary image processing which serves to reduce noise and various image imperfections, as well as to adjust the image quality to make it suitable for further processing.
- step 14 Further processing at step 14 involves document analysis to determine the physical structure of the analyzed document, such as, for example, text blocks, presence or absence of a table, and so on.
- step 16 optical character recognition of the document is performed (conversion of images of typewritten or printed text into machine-encoded text).
- Step 18 represents an entity (field) extraction step of the document processing method of the present invention.
- FIG. 2 illustrates the key steps of the entity (fields) extraction step 18 in FIG. 1 .
- Step 20 in FIG. 2 illustrates constructing a graph as a structure for storing numerical characteristics of the fragments of the text and their links.
- Step 22 in FIG. 2 shows identification of the fields in the image of the document by utilizing the method of cascade classification. Fields in a document may be simple (without an internal structure, e.g. value of the goods) or composite (with an internal structure, e.g. an address field). Therefore in step 24 in FIG. 2 the components of fields are identified by using, for example, regular expressions, key words and other information that is of interest. Extraction of the desired identified information (data) is illustrated in step 26 of FIG. 2 .
- FIG. 5 illustrates a graph constructed for a business card.
- the text representation of the business card ( 500 ) is represented in the form of a graph.
- the nodes ( 502 ) of such graph comprise text fragments of the document being analyzed.
- the nodes of the graph also comprise numerical characteristics of the text fragments.
- the nodes of the graph are compared with the fields in the document during document analysis.
- the nodes of the graph are connected by edges ( 504 ), which store numerical characteristics of logical connections (links) between the text fragments.
- each node of the graph is matched with one word in the text and the edges in the graph set linear order of the words in the text.
- the linear order is a supposed reading order of text in a document that depends on language of the document (For example, for English documents the reading order of text is from left to right; for Hebrew—from right to left).
- Each word has a corresponding node.
- two or more nodes may be merged together or one node can be split into two or more new nodes.
- the edges between the nodes may also be removed or added, the numerical characteristics of the nodes (text fragments) and edges (links) may be changed.
- FIG. 3 illustrates the method of extracting data from a semi-structured document utilizing a graph and a cascading classification.
- a system for extracting the data from a semi-structured document receives a text representation of a document image ( 302 ).
- the text representation is a result of an optical character recognition (OCR) of the document image.
- OCR optical character recognition
- a graph is constructed to represent the recognized text of the document.
- Each node in the graph is matched with a word, a word combination, or a fragment of the text representation (such as, for example, fragment “www” in the “http://” address, or an ending of a last name), meaning that the number of the nodes in the graph is equal to or greater than the number of words in the text representation of the image of the document, and that each node is connected to all other nodes by edges.
- a word such as, for example, fragment “www” in the “http://” address, or an ending of a last name
- a cascade classification of identified text fragments of the document is performed.
- the cascade classification is a method for collecting information about features of text fragments, including computation of the text fragments' features, and about links between the text fragments.
- the cascade classification is an iterative process. The process of the cascade classification runs until the collected information about the text fragments and the links between them is adequate to generate hypotheses about the text fragments belonging to a particular document field. Each iteration in the cascade classification is a running of a particular procedure.
- a procedure is a computer program function that processes a graph, calculates certain features for nodes and edges of the graph, and generates new features. Feature computation and feature generation is performed based on the previous data (calculated by the previous procedures). When a particular procedure is launched, the cascade principle is performed—the nodes corresponding to the text fragments that do not need to be processed with this particular procedure are disregarded.
- the first procedure in the cascade classification includes a font size determination, in which text fragments (nodes) within the business cards are divided into text fragments with corresponding type “small” and text fragments with the corresponding type “large”, and the following procedure is, for example, searching for names in the document, than the nodes having value “small” of feature “size” will be cut off, and the second procedure will calculate values only for the nodes with the font size feature determined to be “large”.
- Keywords are words which are associated with certain fields and which may be used to detect fields.
- FIG. 4 illustrates a block diagram of the cascade classification of the text fragments.
- the system receives a graph ( 402 ).
- the text fragments are nodes of the graph.
- the nodes of the graph are joined by edges (logical links between the text fragments) which set a linear order of recognized text within the document. There can also be other edges, alternative to linear order links.
- step 404 we select a procedure for field searching among the plurality of various procedures based on a resource consuming principle. It means that the most resource-saving procedures are selected first, and the most resource-intensive procedures are selected later.
- the resource-saving procedures are, for example, searching for text fragments, consisting of letters, or text fragments including numbers, or identifying text fragment by font size, etc.
- An example of the resource-intensive procedure is searching for fields using a large electronic dictionary.
- the procedure selected at step 404 is launched.
- the procedure analyzes all text fragments (or some portion of the fragments) and generates value for a feature or values for multiple features (numerical characteristic(s)). More details about the procedures and the features they identify will be provided below.
- the graph is modified based on the values of the features identified by the procedure for all or some of the text fragments. Namely the features values in the nodes (text fragments), in the edges of the graph are updated. Also at this step any changes of the graph structure may be performed. For example, it may become necessary to split a text fragment, containing both letters and numbers, into two nodes to separate the letters from the numbers. This will be done by forming a new node, adding an edge from the old node to the new node, and moving the edges that were originating from the old node for the new node.
- step 410 we determine whether the received information (computed features) for the text fragments and their links is sufficient for generating one or more hypotheses about the types of fields associated with these text fragments.
- next procedure is selected based on the previously computed features.
- each subsequent procedure is more resource-intensive than the previous one.
- An example of a more resource-intensive procedure may be a procedure of searching for a text fragment satisfying a certain regular expression.
- a regular expression is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings.
- the hypotheses about the text fragments belonging to certain fields of the document and their combinations are generated.
- the hypothesis generation is based on the information about the text fragments and their links calculated during the cascade classification 306 .
- hypotheses are generated, they are assigned certain confidence levels.
- the confidence level of a hypothesis depends on the values of features of the text fragments and their links. In one embodiment the confidence of the hypotheses is measured as a percentage from 0 to 100%. Multiple hypotheses with different or equal confidence levels may be generated for a single text fragment. As a result, several combinations of the hypotheses about the fields of the document may be generated.
- a computing of quality of different combinations of hypotheses about the fields of the document is performed.
- the computed quality of the combinations of the hypotheses may be used for comparison the combinations of the hypotheses between each other.
- a number of metrics measuring the quality of the combination of the hypotheses is taken into consideration.
- the first metric is a cumulative metric of the confidence levels for a combination of the hypotheses.
- the cumulative metric of the confidence levels for all the hypotheses within the combination of the hypotheses is computed as a sum of the confidence levels computed for each text fragment with respect to the hypothesis. The higher the cumulative metric of the confidence levels of all the hypotheses within the combination, the better is the quality of the combination.
- the second metric is a cumulative metric of fines of compatibility of different hypotheses within one combination.
- the cumulative metric of the fines for a combination of hypotheses is computed as a sum of the fines.
- the combinations of the hypotheses are fined based on certain rules describing specific fields, their characteristics, their arrangement regarding each other (the geometry of the fields) and possible structure of a document.
- the rules are usually created for a particular type of document. For example, in a business card the name field cannot be absent, there cannot be two name fields or several address fields located in different parts of the document, etc.
- the smaller cumulative metric of compatibility fines of the hypothesis the better is the quality of this combination.
- the computed quality of a combination of hypotheses is at least in part based on the cumulative metric of confidences and the cumulative metric of fines.
- step 312 we determine whether a computed quality of a combination of hypotheses is above a predefined threshold value or whether there is a combination of hypotheses for the entire document of sufficient quality. If such combination is not found, then the method returns to step 306 and the process of obtaining additional characteristics of the text fragments and their links utilizing the cascade classification resumes. In this case, in one embodiment further more complicated (resource-intensive) procedure are used, for example, procedures that utilizes larger glossary. If a suitable combination of hypotheses is found, the analysis of the text of the document is considered completed, i.e., the fields in the image of the document are identified. In another embodiment at step 312 the combination of hypotheses with the highest computed quality is selected.
- each combination its quality is computed, i.e. to each combination we attribute a numerical quality indicator that characterizes the quality of the whole combination of hypotheses.
- the best combination of hypotheses is chosen by comparing these quality indicators of combinations with each other and with a predefined threshold. Such comparison may not be very accurate.
- the feature vector of the combination of the hypotheses may include a list of fields and their links identified by the combination of the hypotheses.
- the feature vector of the first combination of hypotheses may include the following: the document contains two logically linked fields, wherein first field is Name, and the second field is Surname.
- the feature vector of the second combination of hypotheses may include the following: the document contains two logically linked fields, wherein first field is Name and the second field is Position (Job).
- a set of rules for business cards may include a rule according to which the best combination of hypotheses for business cards is a combination, which has both Name and Surname fields, and these fields are located next to each other (logically linked).
- the first combination of hypotheses wins, because the feature vector of this combination is more consistent with the rules. This alternative approach of comparison of the combinations of hypotheses is more accurate and allows to take into account more nuances.
- Running procedures for ascertaining new or additional features should be performed in the order necessary for providing high quality classification.
- the order of running classifiers is determined either manually or automatically.
- the need to recognize a plurality of sales receipts from several different stores is considered.
- the fields in sales receipts from the same store usually have the same geometrical structure—the locations of the fields, the font and other features do not vary on such sales receipts from the same vendor.
- Such features usually vary in their location and fonts on sales receipts from different stores or vendors.
- In order to classify sales receipts from different stores first it is necessary to run the vendors' search procedure, utilizing a dictionary with the vendors' names, i.e. to add the vendor's identification to the cascade classifier and if a vendor was identified, the rest of the fields will be identified utilizing the corresponding template connected with the vendor. If, for example, vendor's cash registers have not changed since the time of the template creation, the use of a current template will significantly speed up the process of extracting data.
- the method of the present invention utilizes different heuristics for more precisely selecting several possible locations of one or another type of a certain field in the receipt. For example, a field corresponding to the name of the vendor often is located next to the field corresponding to the address. Therefore, if the location corresponding to the vendor field is known, then it is likely that the address text is located near the name of the vendor, so that the corresponding feature can be identified and introduced. Conversely, if the vendor name field is found, then we can find the address filed by utilizing the feature ⁇ Text fragment locates near Vendor Name>>.
- One of the examples of using the heuristics in the method of the present invention applies to the case of variations of the features of a text fragment presented as a string of numbers (for example, “ . . . 428”).
- Examples of the corresponding features are the following: a feature of a phone number, or a feature of the price of a purchased item or items, or a feature of the address (a number of the building, a portion of the zip code and the like).
- the features can be in the form of a regular expression.
- An example of such feature is ⁇ Text fragment satisfies regular expression for date format>> (for instance, MM/DD/YYYY).
- the features can be in the form of frequent key words in documents, such as sales receipts—the combination “tel” corresponding to a phone number, the combination “total” corresponding to the total purchase price and so on.
- keywords such as “thank you for coming to” to find the vendor's name field (corresponding feature is ⁇ Text fragment is going after phrase “thank you for coming to”>>).
- the keyword ⁇ address>> may be used to find the address field in a sales receipt or on a business card (corresponding feature is ⁇ Text fragment is going after address label>> (such as “address”)).
- the keyword ⁇ company>> may be used to find the company name field on a business card (corresponding feature is ⁇ Text fragment locates after company label>> (such as “company”)).
- the method and system of the present invention also contemplate using binary features in identifying desired fields in the image of the document. Extracting various entities from a document, such as a sales receipt, occurs automatically by training the system on recognizing binary features (i.e. by classification).
- binary features i.e. by classification
- the binary feature for the entity corresponding to the field with the name of a vendor can be: the proximity of an address field, the presence of the quotation marks, the nearby presence of the words such as “Inc.”, “LLC”.
- the keywords may be a part of a field.
- the examples of such features are ⁇ Text fragment has street keyword in it>> (such as “St.”, “avenue”, “drive”) or ⁇ Text fragment has occupation words>> (such as “agent”, “broker”, “programmer”), or ⁇ Text fragment has city name in it>>.
- ⁇ Text fragment is located after ⁇ url>> label>> (such as “web:”, “url:”), ⁇ Text fragment comprises symbols “www”>> (Maybe not exactly “www”, but something alike, such as “11w”, ⁇ Text fragment includes domain name>> (such as “com”, “net”).
- procedures can also compute the features links (edges) between the text fragments.
- the examples of such features are: ⁇ The edge is between two compatible text fragments>> (such as several columns), ⁇ The edge is between similar horizontal lines>>, ⁇ The edge is between words>>, ⁇ The edge is over a punctuation mark>>, ⁇ The edge is derived by a finder>>.
- the method of cascade classification invokes only some of its procedures which are necessary for the correct identification of only those desired fields.
- the identification process can be completed.
- the identification can be completed if no conflict occurred during the classification, i.e. there were no occurrences of matching one text fragment with several conflicting classes (types), such as “corresponds to a phone number” and ‘does not correspond to a phone number”.
- the inventive method can use an electronic dictionary (a dictionary), but in certain implementations of the method a dictionary is not needed.
- a dictionary is not needed.
- many search procedures use specialized dictionaries. For example, to identify the “name” field, a search procedure uses a specialized dictionary of names; to identify the “address” field, a search procedure uses specialized dictionaries of streets and cities; to identify the “occupation” field a search procedure uses a specialized dictionary of professions, positions, and occupations; to identify fields that are usually accompanied by keyword, such as Tel, Fax, ph, T, F. Total, etc., a search procedure that uses a specialized dictionary of keywords is involved. Words found in a specialized dictionary can be different for different languages and/or countries. Dictionary words are used as features of the fields.
- the above-described problem of dictionary use is solved in the present invention by using a reduced alphabet technique.
- the essence of the reduced alphabet technique is the following.
- an alphabet is a generic term for a set of characters that may include numbers, letters, punctuation marks, and/or mathematical and special symbols.
- the set of meta-representative characters of the groups forms a new alphabet B, where the set of characters in alphabet B is a subset from the characters of alphabet A.
- alphabet B we mapped the set of characters of alphabet A into the set of characters from the alphabet B (f: A ⁇ B). Now we search a word in a specialized dictionary using alphabet B.
- Combining of symbols into groups can be implemented based on OCR error statistics. Namely, if there is a reference text and a recognized version of the same text, we can identify the characters that were recognized incorrectly, i.e. mixed up during the recognition process with high frequency. For example, on the basis of this data, the following characters may be grouped as easily confused for one another: “YyUu”, “AaAa”, “HhH H ”, “OoOo0”, “Uu ” etc. from English and Russian alphabets. Such characters that can be easily confused for one another are grouped together.
- a meta representative character is either selected or created. For example, a group of characters “YyYy” can be represented by one character “Y”.
- Each character from the initial alphabet covered by UTF32 encoding is represented by 4 bytes.
- a character from the reduced alphabet may be represented by a smaller number of bytes.
- the meta representative characters of different groups form a reduced alphabet.
- the reduced alphabet consists of the characters/symbols/letters which either cannot be confused for one another, or which are hard to confuse with any other character.
- the results of the recognition process of a semi-structured document and the words in a specialized dictionary can be represented using this reduced alphabet.
- By replacing letters in the specialized dictionary with corresponding meta representative characters we convert the specialized dictionary into a reduced alphabet dictionary.
- FIG. 7 illustrates an example of a word represented in a specialized dictionary using the technique of the reduced alphabet.
- a business card 700 contains a phone number field, and this field is accompanied by ⁇ Tel>> keyword 701 .
- ⁇ Tel>> can be recognized with different errors.
- FIG. 7 lists some of the variants of recognition of the word “Tel” with errors 702 .
- recognition character groups can be identified ( 704 , 706 and 708 ). In these groups, meta representative characters are identified. In FIG. 7 the identified meta representative characters ( 705 , 707 , 709 ) are marked bold.
- the reduced alphabet dictionary provides not only correct identification of the document fields, in spite of possible OCR errors in word/keywords, but can also help to correct these errors.
- the reduced alphabet dictionary has a special structure.
- a structural unit of the reduced alphabet dictionary is a word in the reduced alphabet.
- the words in the reduced alphabet dictionary are stored with their related source word(s) and with the associated with these source word set(s) of versions of recognition of these source word with OCR errors. If the word in the reduced alphabet is associated with only one source word and, accordingly, with only one set of versions of recognition of the source word with OCR errors, then the incorrect spelling of the source word may be corrected by using the reduced alphabet dictionary.
- FIG. 8 illustrates a more complicated case where one word in the reduced alphabet corresponds to two source words.
- the two source words are Sheila ( 800 ) and Shelia ( 802 ).
- Sheila and Shelia words ( 800 and 802 ) may be represented as SHEIIA word ( 804 ), because the sets of versions of recognition of these two source words intersect ( 806 and 808 ).
- the incorrect spelling of the source word may be corrected by using the reduced alphabet dictionary if the erroneous versions of recognition do not fall in the intersected subset ( 810 ).
- the alphabet consists of hieroglyphs
- such alphabet also can be divided into several groups and the result of recognizing a hieroglyph will be a group containing that hieroglyph, and not the code of that hieroglyph.
- the words are entered into the dictionary using a certain narrowed alphabet.
- FIG. 6 shows exemplary hardware for implementing the techniques and systems described herein, in accordance with one implementation of the present disclosure.
- the exemplary hardware includes at least one processor 602 coupled to a memory 604 .
- the processor 602 may represent one or more processors (e.g. microprocessors), and the memory 604 may represent random access memory (RAM) devices comprising a main storage of the hardware, as well as any supplemental levels of memory, e.g., cache memories, non-volatile or back-up memories (e.g. programmable or flash memories), read-only memories, etc.
- the memory 604 may be considered to include memory storage physically located elsewhere in the hardware, e.g. any cache memory in the processor 602 as well as any storage capacity used as a virtual memory, e.g., as stored on a mass storage device 610 .
- the hardware also typically receives a number of inputs and outputs for communicating information externally.
- the hardware ⁇ may include one or more user input devices 606 (e.g., a keyboard, a mouse, imaging device, scanner, microphone) and a one or more output devices 608 (e.g., a Liquid Crystal Display (LCD) panel, a sound playback device (speaker)).
- the hardware typically includes at least one screen device.
- the hardware may also include one or more mass storage devices 610 , e.g., a floppy or other removable disk drive, a hard disk drive, a Direct Access Storage Device (DASD), an optical drive (e.g. a Compact Disk (CD) drive, a Digital Versatile Disk (DVD) drive) and/or a tape drive, among others.
- the hardware may include an interface with one or more networks 612 (e.g., a local area network (LAN), a wide area network (WAN), a wireless network, and/or the Internet among others) to permit the communication of information with other computers coupled to the networks.
- networks 612 e.g., a local area network (LAN), a wide area network (WAN), a wireless network, and/or the Internet among others
- the hardware typically includes suitable analog and/or digital interfaces between the processor 602 and each of the components 604 , 606 , 608 , and 612 as is well known in the art.
- the hardware operates under the control of an operating system 614 , and executes various computer software applications, components, programs, objects, modules, etc. to implement the techniques described above.
- various applications, components, programs, objects, etc. collectively indicated by application software 616 in FIG. 6 , may also execute on one or more processors in another computer coupled to the hardware via a network 612 , e.g. in a distributed computing environment, whereby the processing required to implement the functions of a computer program may be allocated to multiple computers over a network.
- routines executed to implement the embodiments of the invention may be implemented as part of an operating system or a specific application, component, program, object, module or sequence of instructions referred to as a “computer program.”
- a computer program typically comprises one or more instruction sets at various times in various memory and storage devices in a computer, and that, when read and executed by one or more processors in a computer, cause the computer to perform operations necessary to execute elements involving the various aspects of the invention.
- processors in a computer cause the computer to perform operations necessary to execute elements involving the various aspects of the invention.
- the various embodiments of the invention are capable of being distributed as a program product in a variety of forms, and that the invention applies equally to actually effect the distribution regardless of the particular type of computer-readable media used.
- Examples of computer-readable media include but are not limited to recordable type media such as volatile and non-volatile memory devices, floppy and other removable disks, hard disk drives, optical disks (e.g., Compact Disk Read-Only Memory (CD-ROMs), Digital Versatile Disks (DVDs), flash memory, etc.), among others.
- recordable type media such as volatile and non-volatile memory devices, floppy and other removable disks, hard disk drives, optical disks (e.g., Compact Disk Read-Only Memory (CD-ROMs), Digital Versatile Disks (DVDs), flash memory, etc.), among others.
- CD-ROMs Compact Disk Read-Only Memory
- DVDs Digital Versatile Disks
- flash memory etc.
- Another type of distribution may be implemented as Internet downloads.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Character Discrimination (AREA)
Abstract
Description
Claims (19)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/659,244 US20170323170A1 (en) | 2015-09-07 | 2017-07-25 | Method and system for data extraction from images of semi-structured documents |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
RU2015137956 | 2015-09-07 | ||
RU2015137956A RU2613846C2 (en) | 2015-09-07 | 2015-09-07 | Method and system for extracting data from images of semistructured documents |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/659,244 Continuation US20170323170A1 (en) | 2015-09-07 | 2017-07-25 | Method and system for data extraction from images of semi-structured documents |
Publications (2)
Publication Number | Publication Date |
---|---|
US20170068866A1 US20170068866A1 (en) | 2017-03-09 |
US9754176B2 true US9754176B2 (en) | 2017-09-05 |
Family
ID=58190497
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/868,683 Active US9754176B2 (en) | 2015-09-07 | 2015-09-29 | Method and system for data extraction from images of semi-structured documents |
US15/659,244 Abandoned US20170323170A1 (en) | 2015-09-07 | 2017-07-25 | Method and system for data extraction from images of semi-structured documents |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/659,244 Abandoned US20170323170A1 (en) | 2015-09-07 | 2017-07-25 | Method and system for data extraction from images of semi-structured documents |
Country Status (2)
Country | Link |
---|---|
US (2) | US9754176B2 (en) |
RU (1) | RU2613846C2 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109858420A (en) * | 2019-01-24 | 2019-06-07 | 国信电子票据平台信息服务有限公司 | A kind of bill processing system and processing method |
US10778614B2 (en) | 2018-03-08 | 2020-09-15 | Andre Arzumanyan | Intelligent apparatus and method for responding to text messages |
US11166127B2 (en) | 2018-03-08 | 2021-11-02 | Andre Arzumanyan | Apparatus and method for voice call initiated texting session |
US11556919B2 (en) | 2018-03-08 | 2023-01-17 | Andre Arzumanyan | Apparatus and method for payment of a texting session order from an electronic wallet |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
RU2659745C1 (en) * | 2017-08-28 | 2018-07-03 | Общество с ограниченной ответственностью "Аби Продакшн" | Reconstruction of the document from document image series |
JP7059586B2 (en) * | 2017-11-24 | 2022-04-26 | セイコーエプソン株式会社 | Information processing device, control method of information processing device, and program |
US11205147B1 (en) * | 2018-03-01 | 2021-12-21 | Wells Fargo Bank, N.A. | Systems and methods for vendor intelligence |
RU2695489C1 (en) * | 2018-03-23 | 2019-07-23 | Общество с ограниченной ответственностью "Аби Продакшн" | Identification of fields on an image using artificial intelligence |
US10614301B2 (en) * | 2018-04-09 | 2020-04-07 | Hand Held Products, Inc. | Methods and systems for data retrieval from an image |
CN109271980A (en) * | 2018-08-28 | 2019-01-25 | 上海萃舟智能科技有限公司 | A kind of vehicle nameplate full information recognition methods, system, terminal and medium |
US11393233B2 (en) * | 2020-06-02 | 2022-07-19 | Google Llc | System for information extraction from form-like documents |
CN112364857B (en) * | 2020-10-23 | 2024-04-26 | 中国平安人寿保险股份有限公司 | Image recognition method, device and storage medium based on numerical extraction |
US20230260307A1 (en) * | 2022-02-11 | 2023-08-17 | Black Knight Ip Holding Company, Llc | System and method for extracting data from a source |
US12292934B2 (en) * | 2022-09-07 | 2025-05-06 | Sage Global Services Limited | Classifying documents using geometric information |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5067165A (en) * | 1989-04-19 | 1991-11-19 | Ricoh Company, Ltd. | Character recognition method |
US5862259A (en) * | 1996-03-27 | 1999-01-19 | Caere Corporation | Pattern recognition employing arbitrary segmentation and compound probabilistic evaluation |
US20010019629A1 (en) * | 1997-02-12 | 2001-09-06 | Loris Navoni | Word recognition device and method |
US20080267505A1 (en) * | 2007-04-26 | 2008-10-30 | Xerox Corporation | Decision criteria for automated form population |
US20090067756A1 (en) * | 2006-02-17 | 2009-03-12 | Lumex As | Method and system for verification of uncertainly recognized words in an ocr system |
US20150356365A1 (en) * | 2014-06-09 | 2015-12-10 | I.R.I.S. | Optical character recognition method |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
RU2295154C1 (en) * | 2005-06-16 | 2007-03-10 | "Аби Софтвер Лтд." | Method for recognizing text information from graphic file with usage of dictionaries and additional data |
US7539343B2 (en) * | 2005-08-24 | 2009-05-26 | Hewlett-Packard Development Company, L.P. | Classifying regions defined within a digital image |
RU2309456C2 (en) * | 2005-12-08 | 2007-10-27 | "Аби Софтвер Лтд." | Method for recognizing text information in vector-raster image |
US8879846B2 (en) * | 2009-02-10 | 2014-11-04 | Kofax, Inc. | Systems, methods and computer program products for processing financial documents |
US8929640B1 (en) * | 2009-04-15 | 2015-01-06 | Cummins-Allison Corp. | Apparatus and system for imaging currency bills and financial documents and method for using the same |
WO2012120587A1 (en) * | 2011-03-04 | 2012-09-13 | グローリー株式会社 | Text string cut-out method and text string cut-out device |
-
2015
- 2015-09-07 RU RU2015137956A patent/RU2613846C2/en active
- 2015-09-29 US US14/868,683 patent/US9754176B2/en active Active
-
2017
- 2017-07-25 US US15/659,244 patent/US20170323170A1/en not_active Abandoned
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5067165A (en) * | 1989-04-19 | 1991-11-19 | Ricoh Company, Ltd. | Character recognition method |
US5862259A (en) * | 1996-03-27 | 1999-01-19 | Caere Corporation | Pattern recognition employing arbitrary segmentation and compound probabilistic evaluation |
US20010019629A1 (en) * | 1997-02-12 | 2001-09-06 | Loris Navoni | Word recognition device and method |
US20090067756A1 (en) * | 2006-02-17 | 2009-03-12 | Lumex As | Method and system for verification of uncertainly recognized words in an ocr system |
US20080267505A1 (en) * | 2007-04-26 | 2008-10-30 | Xerox Corporation | Decision criteria for automated form population |
US20150356365A1 (en) * | 2014-06-09 | 2015-12-10 | I.R.I.S. | Optical character recognition method |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10778614B2 (en) | 2018-03-08 | 2020-09-15 | Andre Arzumanyan | Intelligent apparatus and method for responding to text messages |
US11166127B2 (en) | 2018-03-08 | 2021-11-02 | Andre Arzumanyan | Apparatus and method for voice call initiated texting session |
US11310174B2 (en) | 2018-03-08 | 2022-04-19 | Andre Arzumanyan | Intelligent apparatus and method for responding to text messages |
US11556919B2 (en) | 2018-03-08 | 2023-01-17 | Andre Arzumanyan | Apparatus and method for payment of a texting session order from an electronic wallet |
US11875279B2 (en) | 2018-03-08 | 2024-01-16 | Andre Arzumanyan | Method for payment of an order to be forwarded to one of a plurality of client establishments |
CN109858420A (en) * | 2019-01-24 | 2019-06-07 | 国信电子票据平台信息服务有限公司 | A kind of bill processing system and processing method |
Also Published As
Publication number | Publication date |
---|---|
US20170323170A1 (en) | 2017-11-09 |
RU2015137956A (en) | 2017-03-13 |
US20170068866A1 (en) | 2017-03-09 |
RU2613846C2 (en) | 2017-03-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9754176B2 (en) | Method and system for data extraction from images of semi-structured documents | |
US11734328B2 (en) | Artificial intelligence based corpus enrichment for knowledge population and query response | |
US8468167B2 (en) | Automatic data validation and correction | |
US7529748B2 (en) | Information classification paradigm | |
CN112016304A (en) | Text error correction method and device, electronic equipment and storage medium | |
US7937338B2 (en) | System and method for identifying document structure and associated metainformation | |
JP2734386B2 (en) | String reader | |
US8996524B2 (en) | Automatically mining patterns for rule based data standardization systems | |
CN112036145A (en) | Financial statement identification method and device, computer equipment and readable storage medium | |
US10963717B1 (en) | Auto-correction of pattern defined strings | |
US9286526B1 (en) | Cohort-based learning from user edits | |
US20170124435A1 (en) | Method for Text Recognition and Computer Program Product | |
US10769360B1 (en) | Apparatus and method for processing an electronic document to derive a first electronic document with electronic-sign items and a second electronic document with wet-sign items | |
US20160189057A1 (en) | Computer implemented system and method for categorizing data | |
US10699112B1 (en) | Identification of key segments in document images | |
CN115039144B (en) | Method and computing device for processing math and text in handwriting | |
Ha et al. | Information extraction from scanned invoice images using text analysis and layout features | |
US10896292B1 (en) | OCR error correction | |
US8750571B2 (en) | Methods of object search and recognition | |
JP2014182477A (en) | Program and document processing device | |
US20240331432A1 (en) | Method and apparatus for data structuring of text | |
CN107688609B (en) | Job label recommendation method and computing device | |
Fateh et al. | Enhancing optical character recognition: Efficient techniques for document layout analysis and text line detection | |
KR102282025B1 (en) | Method for automatically sorting documents and extracting characters by using computer | |
JP5550959B2 (en) | Document processing system and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: ABBYY DEVELOPMENT LLC, RUSSIAN FEDERATION Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:KOSTYUKOV, MIKHAIL;REEL/FRAME:036771/0564 Effective date: 20151012 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
CC | Certificate of correction | ||
AS | Assignment |
Owner name: ABBYY PRODUCTION LLC, RUSSIAN FEDERATION Free format text: MERGER;ASSIGNOR:ABBYY DEVELOPMENT LLC;REEL/FRAME:047997/0652 Effective date: 20171208 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
AS | Assignment |
Owner name: ABBYY DEVELOPMENT INC., NORTH CAROLINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ABBYY PRODUCTION LLC;REEL/FRAME:059249/0873 Effective date: 20211231 |
|
AS | Assignment |
Owner name: WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT, CALIFORNIA Free format text: SECURITY INTEREST;ASSIGNORS:ABBYY INC.;ABBYY USA SOFTWARE HOUSE INC.;ABBYY DEVELOPMENT INC.;REEL/FRAME:064730/0964 Effective date: 20230814 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |