US10019992B2 - Speech-controlled actions based on keywords and context thereof - Google Patents

Speech-controlled actions based on keywords and context thereof Download PDF

Info

Publication number
US10019992B2
US10019992B2 US14/754,457 US201514754457A US10019992B2 US 10019992 B2 US10019992 B2 US 10019992B2 US 201514754457 A US201514754457 A US 201514754457A US 10019992 B2 US10019992 B2 US 10019992B2
Authority
US
United States
Prior art keywords
keyword
context
recognition module
speech
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US14/754,457
Other versions
US20160379633A1 (en
Inventor
Jill Fain Lehman
Samer Al Moubayed
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Disney Enterprises Inc
Original Assignee
Disney Enterprises Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Disney Enterprises Inc filed Critical Disney Enterprises Inc
Priority to US14/754,457 priority Critical patent/US10019992B2/en
Assigned to DISNEY ENTERPRISES, INC. reassignment DISNEY ENTERPRISES, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: AL MOUBAYED, SAMER, LEHMAN, JILL FAIN
Publication of US20160379633A1 publication Critical patent/US20160379633A1/en
Application granted granted Critical
Publication of US10019992B2 publication Critical patent/US10019992B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/227Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/228Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of application context

Definitions

  • voice-activated devices have become more and more popular and have found new applications.
  • the speech recognition modules incorporated into such devices are trained to recognize specific keywords, they tend to be unreliable. This is because the specific keywords may appear in a spoken sentence and be incorrectly recognized as voice commands by the speech recognition module when not intended by the user.
  • the specific keywords intended to be taken as commands may not be recognized by the speech recognition module, because the specific keywords may appear in between other spoken words, and be ignored. Both situations can frustrate the user and cause the user to give up and resort to inputting the commands manually, speak the keywords numerous times or turn off the voice recognition.
  • the present disclosure is directed to speech-controlled actions based on keywords and context thereof, substantially as shown in and/or described in connection with at least one of the figures, as set forth more completely in the claims.
  • FIG. 1 shows a diagram of an exemplary device with speech recognition capability, according to one implementation of the present disclosure
  • FIG. 2 shows an exemplary operational flow diagram for the device of FIG. 1 with speech recognition capability, according to one implementation of the present disclosure
  • FIG. 3 shows a flowchart illustrating an exemplary speech recognition method for use by the device of FIG. 1 , according to one implementation of the present disclosure.
  • FIG. 1 shows a diagram of device 100 with speech recognition capability, according to one implementation of the present disclosure.
  • Device 100 includes microphone 105 , input device 107 , audio-to-digital (A/D) converter 115 , processor 120 , memory 130 , and component 190 .
  • Device 100 may be a video game system, a robot, an automated appliance, such as a radio or a kitchen appliance, or any other device or equipment that can be command-controlled.
  • device 100 may be a video game system configured to receive play instructions from a user by speech or voice commands, or an oven configured to receive operating instructions by speech or voice commands.
  • Device 100 uses microphone 105 to receive speech or voice commands from a user.
  • A/D converter 115 is configured to receive an input speech or audio from microphone 105 , and to convert input speech 106 , which is in analog form, to digitized speech 108 , which is in digital form.
  • A/D converter 115 is electronically connected to speech recognition module 140 , such that A/D converter 115 can send digitized speech 108 to speech recognition module 140 .
  • analog audio signals or input speech 106 may be converted into digital signals or digitized speech 108 to allow speech recognition module 140 to recognize spoken words and phrases. This is typically accomplished by pre-processing digitized speech 108 , extracting features from the pre-processed digitized speech, and performing computation and scoring to match extracted features of the pre-processed digitized speech with keywords.
  • input device 107 may be a non-auditory input device, such as a camera, a motion sensor, a biometric sensor, etc.
  • input device 107 may be a camera that captures images of one or more participants and the environment, which are used by an image processing module (not shown) under the control of processor 120 .
  • the information related to one or more participants and the environment may be used by context recognition module 160 , under the control of processor 120 , to determine the context of specific keyword. For example, if the image processing module determines that the participant is facing away from device 100 while uttering the specific keyword, context recognition module 160 may inform processing module 180 that the specific keyword should not be considered a command.
  • input device 107 may be a motion sensor, which can sense movements of one or more participants using a motion sensing module (not shown) under the control of processor 120 .
  • the information related to motions of one or more participants may be used by context recognition module 160 , under the control of processor 120 , to determine whether a specific keyword recognized by keyword recognition module 150 should be executed by processing module 180 as a command.
  • input device 107 may be another microphone, which can be used by context recognition module 160 , under the control of processor 120 , to extract additional features from the speech signal, such as pitch, prosodic contour, etc., so that context recognition module 160 may use such additional features to determine whether a detected voice command should be executed as a command by processing module 180 or not.
  • a change in pitch of the uttered word or a change in volume at which the word is uttered may also be considered by context recognition module 160 .
  • digitized speech 108 may be used to extract such additional features for use by context recognition module 160 .
  • Processor 120 may be configured to access memory 130 to store received input or to execute commands, processes, or programs stored in memory 130 , such as keyword recognition module 150 , context recognition module 160 , and processing module 180 .
  • Processor 120 may correspond to a control processing unit, such as a microprocessor or similar hardware processing device, or a plurality of hardware devices.
  • FIG. 1 shows a single processor, namely processor 120 , in other implementations, keyword recognition module 150 , context recognition module 160 , and processing module 180 may be executed by different processors.
  • Memory 130 is a non-transitory storage device capable of storing commands, processes, data and programs for execution by processor 120 . In one implementation, at least some programs and data may be stored in a cloud-based storage device.
  • digitized speech 108 may be transmitted over a communication network, such as the Internet, to a server including speech recognition module 140 for processing digitized speech 108 , and returning the result to device 100 for controlling component 190 .
  • Speech recognition module 140 includes keyword recognition module 150 , context recognition module 160 , and may optionally include processing module 180 .
  • Keyword recognition module 150 may include keyword library 155
  • context recognition module 160 may include grammar 170 .
  • Keyword recognition module 150 and context recognition module 160 are software algorithms for recognizing speech. Speech recognition module 140 may include different aspects of speech recognition. Keyword recognition module 150 is adapted to recognize utterances of specific keywords and context recognition module 160 is adapted to recognize the context in which a keyword is uttered. In some implementations, context recognition module 160 may determine whether or not a specific keyword recognized by keyword recognition module 150 should be considered a command or an instruction in view of the specific keyword's context as recognized by context recognition module 160 . In one implementation, keyword recognition module 150 recognizes a specific keyword, and context recognition module 160 running in parallel with keyword recognition module 150 , is able to recognize the context in which the specific keyword has appeared, so as to confirm the intended purpose of the specific keyword.
  • context recognition module 160 may be an independent speech recognizer, which also recognizes the specific keyword.
  • context recognition module 160 may run partially in parallel with keyword recognition module 150 , such that keyword recognition module 150 provides speech recognition result 157 , such as the detected keyword and information related thereto, to context recognition module 160 to determine the context of the detected keyword.
  • the word “go” may be a command intended to cause a video game to progress or a robot to move forward, or the word “go” may be part of a conversation about a trip to the store.
  • a player of a video game whose avatar is perched on a cliff may desire the video game to distinguish between an utterance of the word “go” in a conversation about an intention to “go” to the store and an utterance of the word “go” intended to progress the video game, or an utterance by another player saying “no, don't go yet.”
  • speech recognition module 140 includes keyword recognition module 150 to detect a word or phrase that is defined as a command, and context recognition module 160 to analyze the context of that command, such as the command “go.”
  • keyword recognition module 150 detects the keyword or command “go” in the player's speech
  • context recognition module 160 detects the context of the word “go,” which may appear in isolation (i.e., as a command) or in the player's non-command speech, such as “
  • Keyword recognition module 150 may be configured to recognize keywords. As shown in FIG. 1 , keyword recognition module 150 includes keyword library 155 , where keyword library 155 includes a plurality of keywords. Keyword recognition module 150 may be configured to detect keywords corresponding to any of the plurality of keywords in keyword library 155 . Keyword library 155 may include words, or combinations of words. In some implementations, keyword library 155 may include English words, English phrases, non-English words, non-English phrases, or words and phrases in a plurality of languages.
  • a keyword may be a single word, a series of words, an instruction, a command, or a combination of words.
  • a keyword may include commands or instructions, such as “go” or “jump,” and may be instructions to direct a character or avatar in a video game.
  • keywords may include commands or instructions to control or program an appliance, such as “preheat oven to 350°” or “tune radio to 106.7 FM,” or “turn oven on at 5:30 pm, preheat to 350°.”
  • device 100 may be used to support language-based interaction between a child and a robot cooperatively playing a fast-paced video game in real-time. While playing the video game, the time between a user speaking a keyword and the implementation of an action associated with the keyword should be minimized.
  • a video game may have obstacles or opponents that move across the screen towards a player's avatar, and if the obstacle or opponent contacts the character or avatar, a negative consequence may occur, such as a loss of health or death of the character in the video game. Accordingly, the user may desire a video game system that reacts quickly when the user utters a keyword intended as an instruction.
  • keyword recognition module 150 may be continuously listening for keywords found in keyword library 155 , and when keyword recognition module 150 detects a keyword, keyword recognition module 150 may initiate a process for executing an action associated with the detected keyword.
  • keyword recognition module 150 may always be listening, even if device 100 is not actively in use. For example, a smart oven may be always on, such that a user is able to simply speak the instruction “preheat oven to 350°” to initiate preheating of the oven without first manually interacting with the oven to activate speech recognition module 140 .
  • keyword recognition module 150 may be continuously listening only while device 100 is in use.
  • context recognition module 160 may be configured to begin listening when the speech input signal is received from the microphone.
  • Context recognition module 160 may be configured to analyze the context of a keyword. Such an analysis may be useful to distinguish between the video game instruction “go” and conversation about an intention to “go” to the store.
  • the context of a keyword may include words before or after the keyword, or an absence of words before or after the keyword.
  • context recognition module 160 may also include a voice activity detector (VAD) for detecting silence for determining a context of detected keywords.
  • VAD voice activity detector
  • the context of a keyword is not limited to the words spoken before and after the keyword, but context may also include additional information, such as biometric data, physical gestures, and body language, which may be determined using input device 107 .
  • context may include the location of the person speaking the keyword, which may be determined by input device 107 using proximity sensors. For example, the keyword “go” spoken by a person playing a video game system in a living room has a different context than the keyword “go” spoken by a person standing in a garage.
  • context recognition module 160 includes grammar 170 .
  • Grammar 170 may contain a plurality of rules, and each rule may define a set of constraints that context recognition module 160 uses to restrict the possible word or sentence choices while analyzing the context of a detected keyword.
  • Grammar 170 may include properties that are specific to the grammar, such as locale, semantic format, and mode.
  • Grammar 170 includes properties that can be set to optimize speech recognition module 140 for specific recognition environments and tasks.
  • grammar 170 may include properties that specify the language that grammar 170 contains, which of the rules of grammar 170 to apply, and the format for semantic content of grammar 170 .
  • Processing module 180 may optionally be included as a unique component of speech recognition module 140 , or may alternatively be included in keyword recognition module 150 or context recognition module 160 . Processing module 180 may determine when to initiate a process for executing an action associated with a detected keyword in keyword library 155 . Processing module 180 may act as a gatekeeper to determine whether to execute the action associated with the keyword. In some implementations, processing module 180 receives input from keyword recognition module 150 and context recognition module 160 , and processing module 180 determines when to proceed with a process for an action associated with the keyword.
  • keyword recognition module 150 may initiate a process for executing an action associated with the detected keyword.
  • context recognition module 160 may detect the keyword in digitized speech 108 .
  • context recognition module 160 may analyze the context of the detected keyword. Based on the context determined by context recognition module 160 , processing module 180 may determine that the detected keyword is not an instruction, but instead part of a social conversation. In this situation, processing module 180 , acting as a gatekeeper, may prevent keyword recognition module 150 from initiating a process for executing a command, terminate the process if already initiated, or allow the action associated with the keyword to be executed.
  • the action associated with the keyword may include sending output signal 188 to component 190 , which may be a display, a robot arm, a heating part of an oven, etc.
  • device 100 may receive additional input while analyzing the context of the detected keyword. Additional input may include words spoken by other individuals in the vicinity of device 100 . In some implementations, additional input may include sensory input, visual input, biometric input, or other non-verbal input. In some implementations, device 100 may include input device 107 to receive additional input. For example, input device 107 may include a camera to receive additional input related to a user's body position, or a motion detector to detect a user's motion, such as a physical gesture, when keyword recognition module 150 or context recognition module 160 detects a keyword. In some implementations, context recognition module 160 may receive this additional input to assist in determining the context of the keyword and analyze the context of the keyword based on the second input.
  • input device 107 may include a camera to receive additional input related to a user's body position, or a motion detector to detect a user's motion, such as a physical gesture, when keyword recognition module 150 or context recognition module 160 detects a keyword.
  • context recognition module 160 may receive this additional input to assist
  • Component 190 may be a visual output, an audio output, a signal, or a functional, mechanical or moving element of device 100 that may be instantiated by execution of the action associated with the keyword.
  • component 190 may be a display, such as a computer monitor, a television, the display of a tablet computer, the display of a mobile telephone, or any display known in the art.
  • component 190 may be a speaker, such as a speaker in a home stereo, a car stereo, in headphones, in a device with a display as above, or any device having a speaker.
  • component 190 may be a functional component of device 100 , such as a heating element of an oven, an electric motor of a fan, a motor of an automatic door, or other device that may be found in a smart home.
  • component 190 may comprise an individual component or a plurality of components.
  • FIG. 2 shows an exemplary operational flow diagram for device 100 , according to one implementation of the present disclosure.
  • Flow diagram 200 depicts three distinct scenarios.
  • the operational flow diagram begins with the user speaking or uttering one or more words, where 201 a / 201 b / 201 c show the spoken word(s), and two outgoing arrows depict the parallel processes of keyword recognition module 150 and context recognition module 160 .
  • keyword recognition module 150 and context recognition module 160 process digitized speech 108 received from A/D converter 115 .
  • keyword recognition module 150 detects keywords and context recognition module 160 detects a context for the detected keywords.
  • context recognition module 160 may receive speech recognition result 157 from keyword recognition module 150 .
  • context recognition module 160 may receive information from keyword recognition module 150 that a keyword has been detected.
  • processing module 180 determines whether to proceed with the execution of an action associated with detected keyword(s).
  • keyword recognition module 250 a detects the keyword “go” and initiates a process for executing an action associated with the keyword “go.”
  • context recognition module 260 a may receive an indication from keyword recognition module 250 a that the keyword “go” has been detected, and analyze the context of the keyword.
  • context recognition module 260 a may itself detect the keyword “go” and analyze the context of the keyword.
  • keyword recognition module 250 a sends a signal to processing module 180 to initiate an action associated with the keyword “go.” Also, context recognition module 260 a analyzes a context of the keyword and determines, based on the context of the keyword, that the keyword is more likely an instruction. As a result, context recognition module 260 a sends a signal to processing module 180 to proceed with executing the action. At 204 a , processing module 180 proceeds with the action associated with the keyword “go” to actuate component 190 .
  • context recognition module 260 a may use speech recognition algorithms to determine the context for the detected keyword based on one or more words uttered before and/or after the keyword. In another implementation, context recognition module 260 a may determine the context for the detected keyword based on non-verbal indicators alone, such as the location of the speaker, body language of the speaker, etc. In other implementations, context recognition module 260 a may use a combination of verbal and non-verbal inputs to determine the context for the detected keyword.
  • the user utters the keyword “go,” and may also utter a few other words before and/or after the keyword, where those other few words may be used determine a context for the keyword for calculating the probability of the keyword being an instruction.
  • other indicators may be used, by themselves or in addition to contextual speech, to determine a context for the keyword for deciding whether the keyword should be considered an instruction or not, including non-verbal indicators such as the location of the speaker, body language of the speaker, etc.
  • keyword recognition module 250 b detects the keyword “go” and initiates a process for executing an action associated with the keyword “go.”
  • Context recognition module 260 b may receive an indication from keyword recognition module 250 b regarding the detected keyword, or in another implementation, context recognition module 260 b may detect the keyword “go,” and further analyzes the context of the keyword to determine whether or not the detected keyword should classified as a command or instruction.
  • keyword recognition module 250 b continues the process for executing the action associated with the keyword “go,” and context recognition module 260 a determines that the detected keyword is most likely not an instruction, based on the words spoken before and after the word “go,” and/or secondary input(s), such as pitch and intensity of speech, motion sensing, facial expression of the player, etc.
  • processing module 180 terminates the action associated with the keyword “go.”
  • termination of the action associated with the keyword may include ending the process initiated to execute the action associated with the keyword. For example, if the action were to preheat an oven to 350°, the oven may begin preheating and then the process of preheating may be terminated by turning off the heating element.
  • termination of the action associated with the keyword may occur after execution of the action has occurred or begun to occur, and thus termination may include executing an action negating the action associated with the keyword. For example, if the action were to open a door, processing module 180 may terminate the action associated with the keyword “go” (terminate opening the door), and an action closing the door may be executed, thereby negating the initial action of beginning to open the door.
  • the user utters the keyword “go.
  • keyword recognition module 250 c detects the keyword “go,” and initiates a process for an action associated with the keyword “go” to be taken by component 190 .
  • Context recognition module 260 c operating along with keyword recognition module 250 c , may also detect the keyword “go,” or may receive an indication from keyword recognition module 250 c that the keyword “go” has been detected.
  • context recognition module 260 c determines the context of the keyword.
  • the context for the detected keyword may be determined using only one or a combination of inputs or factors. For example, a VAD may be used determine whether there is additional speech before and/or after the detected keyword.
  • detection of additional speech may indicate a lower likelihood for the detected keyword being an instruction.
  • grammar 170 of context recognition module 260 c may analyze additional speech before and/or after the detected keyword to determine the likelihood of the detected keyword being an instruction. As an example, whether the keyword “go” appears in “go for it” or “I want to go to the store now.”
  • Context recognition module 260 c may also analyze non-verbal indicators such as the location of the speaker, body language of the speaker, facial expression, movement, etc.
  • keyword recognition module 250 c continues the process for executing the action associated with the keyword “go,” and context recognition module 260 c determines, based on the context of the keyword, that the keyword is more likely not an instruction.
  • processing module 180 may terminate the process for executing the action associated with the keyword “go” before execution of the action by component 190 has begun.
  • FIG. 3 shows a flowchart illustrating exemplary speech recognition method 300 for use by device 100 , according to one implementation of the present disclosure.
  • device 100 uses microphone 105 to receive input speech 106 spoken by a user.
  • the microphone may be a peripheral device electronically connected to device 100 .
  • device 100 may use an array of microphones to determine the location from which speech originates. For example, a video game system having an array of microphones may be able to distinguish between speech coming from a player and speech coming from a person in another room.
  • device 100 uses A/D converter 115 to convert input speech 106 from an analog form to a digital form, and generates digitized speech 108 .
  • A/D converter 115 samples the analog signal at regular intervals and sends digitized speech 108 to speech recognition module 140 .
  • keyword recognition module 150 detects a keyword in input speech 106 .
  • keyword recognition module is continuously listening for instances of keywords, and in other implementation, keyword recognition module 150 may include a VAD, such that the keyword recognition module 150 begins listening for instances of keywords when speech is detected.
  • a keyword may be a word or series of words associated with an action.
  • keyword recognition module 150 includes keyword library 155 which may include a plurality of keywords. Each keyword of keyword library 155 may be associated with a particular action.
  • keyword recognition module 150 may pre-process digitized speech 108 , extract features from the pre-processed digitized speech, and perform computation and scoring to match extracted features of the pre-processed digitized speech with keywords in keyword library 155 .
  • keyword recognition module 150 in response to detecting a keyword in digitized speech 108 , initiates a process for executing an action associated with the keyword.
  • processor 120 executes the action associated with the keyword with substantially no delay, other than the inherent delay of communicating signals within device 100 .
  • keyword recognition module 150 initiates the process for executing an action associated with the keyword, execution of the action is delayed awaiting a determination by context recognition module 160 that the detected keyword is an instruction.
  • keyword recognition module 150 informs context recognition module 160 that digitized speech 108 includes the keyword.
  • context recognition module 160 may independently determine that digitized speech 108 includes the keyword.
  • context recognition module 160 determines a context for the keyword. Determining the context for the keyword may be based on words before and/or after the detected keyword in input speech 106 . For example, if context recognition module 160 determines the keyword to be a stand-alone word without any speech before or after the instruction keyword, the keyword is likely to be classified as an instruction. Context recognition module 160 may use a VAD to determine if there is any speech before or after the uttered keyword. In one implementation, the context may also include other sensory input, such as visual input, biometric input, or other non-verbal or verbal input. For example, context recognition module 160 may determine the keyword is less likely to be a command if the visual input indicates that the user is not facing the device.
  • context recognition module 160 may analyze the keywords appearing before and/or after the keyword using grammar 170 , and additional factors, such as silence detection, location of the speaker, facial expression of the speaker, gesture and movements of the speaker, etc., to determine whether the keyword is more likely or less likely to be an instruction in view of the context.
  • Context recognition module 160 may also include grammar 170 for determining the context for uttered keywords based on the speech spoken before or after the uttered keywords.
  • grammar 170 may independently detect keywords in speech, and may include keywords similar to those of keyword library 155 or more.
  • Grammar 170 may contain a plurality of rules, where each rule defines a set of language constraints that context recognition module 160 uses to restrict possible word or sentence choices while determining the context for keywords that are designated as instructions or commands in context recognition module 160 .
  • Grammar 170 may include properties that can be set to optimize speech recognition module 140 for specific recognition environments and tasks.
  • grammar 170 may include properties that specify the language for grammar 170 , which grammar 170 rules to use, and the format or semantic content for grammar 170 .
  • context recognition module 160 may use a VAD for silence and voice detection to determine if there is any speech before or after the uttered keyword.
  • the context may also include other sensory input, such as visual input, biometric input, or other non-verbal or verbal input.
  • context recognition module 160 may determine the keyword is less likely to be a command if the visual input indicates that the user is not facing the device.
  • device 100 may proceed with the process of executing an action associated with the keyword if the context does not reject or does confirm execution of the action.
  • context recognition module 160 determines the detected keyword is less likely or more likely to be a command or instruction based on the context. For example, based on the context, context recognition module 160 may determine a probability that the keyword is intended as an instruction, such as a 10%, 20%, 30% chance, etc. that the keyword is intended as an instruction.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)
  • Artificial Intelligence (AREA)

Abstract

A device includes a plurality of components, a memory having a keyword recognition module and a context recognition module, a microphone configured to receive an input speech spoken by a user, an analog-to-digital converter configured to convert the input speech from an analog form to a digital form and generate a digitized speech, and a processor. The processor is configured to detect, using the keyword recognition module, a keyword in the digitized speech, initiate, in response to detecting the keyword by the keyword recognition module, an action to be taken one of the plurality of components, wherein the keyword is associated with the action, determine, using the context recognition module, a context for the keyword, and execute the action if the context determined by the context recognition module indicates that the keyword is a command.

Description

BACKGROUND
As speech recognition technology has advanced, voice-activated devices have become more and more popular and have found new applications. Today, an increasing number of mobile phones, in-home devices, and automobile devices include speech or voice recognition capabilities. Although the speech recognition modules incorporated into such devices are trained to recognize specific keywords, they tend to be unreliable. This is because the specific keywords may appear in a spoken sentence and be incorrectly recognized as voice commands by the speech recognition module when not intended by the user. Also, in some cases, the specific keywords intended to be taken as commands may not be recognized by the speech recognition module, because the specific keywords may appear in between other spoken words, and be ignored. Both situations can frustrate the user and cause the user to give up and resort to inputting the commands manually, speak the keywords numerous times or turn off the voice recognition.
SUMMARY
The present disclosure is directed to speech-controlled actions based on keywords and context thereof, substantially as shown in and/or described in connection with at least one of the figures, as set forth more completely in the claims.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 shows a diagram of an exemplary device with speech recognition capability, according to one implementation of the present disclosure;
FIG. 2 shows an exemplary operational flow diagram for the device of FIG. 1 with speech recognition capability, according to one implementation of the present disclosure; and
FIG. 3 shows a flowchart illustrating an exemplary speech recognition method for use by the device of FIG. 1, according to one implementation of the present disclosure.
DETAILED DESCRIPTION
The following description contains specific information pertaining to implementations in the present disclosure. The drawings in the present application and their accompanying detailed description are directed to merely exemplary implementations. Unless noted otherwise, like or corresponding elements among the figures may be indicated by like or corresponding reference numerals. Moreover, the drawings and illustrations in the present application are generally not to scale, and are not intended to correspond to actual relative dimensions.
FIG. 1 shows a diagram of device 100 with speech recognition capability, according to one implementation of the present disclosure. Device 100 includes microphone 105, input device 107, audio-to-digital (A/D) converter 115, processor 120, memory 130, and component 190. Device 100 may be a video game system, a robot, an automated appliance, such as a radio or a kitchen appliance, or any other device or equipment that can be command-controlled. For example, device 100 may be a video game system configured to receive play instructions from a user by speech or voice commands, or an oven configured to receive operating instructions by speech or voice commands.
Device 100 uses microphone 105 to receive speech or voice commands from a user. A/D converter 115 is configured to receive an input speech or audio from microphone 105, and to convert input speech 106, which is in analog form, to digitized speech 108, which is in digital form. As shown in FIG. 1, A/D converter 115 is electronically connected to speech recognition module 140, such that A/D converter 115 can send digitized speech 108 to speech recognition module 140. Using A/D converter 115, analog audio signals or input speech 106 may be converted into digital signals or digitized speech 108 to allow speech recognition module 140 to recognize spoken words and phrases. This is typically accomplished by pre-processing digitized speech 108, extracting features from the pre-processed digitized speech, and performing computation and scoring to match extracted features of the pre-processed digitized speech with keywords.
In some implementations, input device 107 may be a non-auditory input device, such as a camera, a motion sensor, a biometric sensor, etc. For example, input device 107 may be a camera that captures images of one or more participants and the environment, which are used by an image processing module (not shown) under the control of processor 120. The information related to one or more participants and the environment may be used by context recognition module 160, under the control of processor 120, to determine the context of specific keyword. For example, if the image processing module determines that the participant is facing away from device 100 while uttering the specific keyword, context recognition module 160 may inform processing module 180 that the specific keyword should not be considered a command.
In one implementation, input device 107 may be a motion sensor, which can sense movements of one or more participants using a motion sensing module (not shown) under the control of processor 120. The information related to motions of one or more participants may be used by context recognition module 160, under the control of processor 120, to determine whether a specific keyword recognized by keyword recognition module 150 should be executed by processing module 180 as a command. In yet another implementation, input device 107 may be another microphone, which can be used by context recognition module 160, under the control of processor 120, to extract additional features from the speech signal, such as pitch, prosodic contour, etc., so that context recognition module 160 may use such additional features to determine whether a detected voice command should be executed as a command by processing module 180 or not. For example, a change in pitch of the uttered word or a change in volume at which the word is uttered may also be considered by context recognition module 160. In one implementation, digitized speech 108 may be used to extract such additional features for use by context recognition module 160.
Processor 120 may be configured to access memory 130 to store received input or to execute commands, processes, or programs stored in memory 130, such as keyword recognition module 150, context recognition module 160, and processing module 180. Processor 120 may correspond to a control processing unit, such as a microprocessor or similar hardware processing device, or a plurality of hardware devices. Although FIG. 1 shows a single processor, namely processor 120, in other implementations, keyword recognition module 150, context recognition module 160, and processing module 180 may be executed by different processors. Memory 130 is a non-transitory storage device capable of storing commands, processes, data and programs for execution by processor 120. In one implementation, at least some programs and data may be stored in a cloud-based storage device. For example, in one implementation digitized speech 108 may be transmitted over a communication network, such as the Internet, to a server including speech recognition module 140 for processing digitized speech 108, and returning the result to device 100 for controlling component 190. Speech recognition module 140 includes keyword recognition module 150, context recognition module 160, and may optionally include processing module 180. Keyword recognition module 150 may include keyword library 155, and context recognition module 160 may include grammar 170.
Keyword recognition module 150 and context recognition module 160 are software algorithms for recognizing speech. Speech recognition module 140 may include different aspects of speech recognition. Keyword recognition module 150 is adapted to recognize utterances of specific keywords and context recognition module 160 is adapted to recognize the context in which a keyword is uttered. In some implementations, context recognition module 160 may determine whether or not a specific keyword recognized by keyword recognition module 150 should be considered a command or an instruction in view of the specific keyword's context as recognized by context recognition module 160. In one implementation, keyword recognition module 150 recognizes a specific keyword, and context recognition module 160 running in parallel with keyword recognition module 150, is able to recognize the context in which the specific keyword has appeared, so as to confirm the intended purpose of the specific keyword. In such an implementation, context recognition module 160 may be an independent speech recognizer, which also recognizes the specific keyword. In other implementations, context recognition module 160 may run partially in parallel with keyword recognition module 150, such that keyword recognition module 150 provides speech recognition result 157, such as the detected keyword and information related thereto, to context recognition module 160 to determine the context of the detected keyword.
For example, the word “go” may be a command intended to cause a video game to progress or a robot to move forward, or the word “go” may be part of a conversation about a trip to the store. A player of a video game whose avatar is perched on a cliff may desire the video game to distinguish between an utterance of the word “go” in a conversation about an intention to “go” to the store and an utterance of the word “go” intended to progress the video game, or an utterance by another player saying “no, don't go yet.” To satisfy this need, speech recognition module 140 includes keyword recognition module 150 to detect a word or phrase that is defined as a command, and context recognition module 160 to analyze the context of that command, such as the command “go.” As such, although keyword recognition module 150 detects the keyword or command “go” in the player's speech, context recognition module 160 detects the context of the word “go,” which may appear in isolation (i.e., as a command) or in the player's non-command speech, such as “I will go to the store after I am done playing this video game.”
Keyword recognition module 150 may be configured to recognize keywords. As shown in FIG. 1, keyword recognition module 150 includes keyword library 155, where keyword library 155 includes a plurality of keywords. Keyword recognition module 150 may be configured to detect keywords corresponding to any of the plurality of keywords in keyword library 155. Keyword library 155 may include words, or combinations of words. In some implementations, keyword library 155 may include English words, English phrases, non-English words, non-English phrases, or words and phrases in a plurality of languages.
A keyword may be a single word, a series of words, an instruction, a command, or a combination of words. A keyword may include commands or instructions, such as “go” or “jump,” and may be instructions to direct a character or avatar in a video game. In some implementations, keywords may include commands or instructions to control or program an appliance, such as “preheat oven to 350°” or “tune radio to 106.7 FM,” or “turn oven on at 5:30 pm, preheat to 350°.”
Some devices utilizing speech recognition require a quick response when a keyword is spoken. For example, device 100 may be used to support language-based interaction between a child and a robot cooperatively playing a fast-paced video game in real-time. While playing the video game, the time between a user speaking a keyword and the implementation of an action associated with the keyword should be minimized. In some implementations, a video game may have obstacles or opponents that move across the screen towards a player's avatar, and if the obstacle or opponent contacts the character or avatar, a negative consequence may occur, such as a loss of health or death of the character in the video game. Accordingly, the user may desire a video game system that reacts quickly when the user utters a keyword intended as an instruction. In some implementations, keyword recognition module 150 may be continuously listening for keywords found in keyword library 155, and when keyword recognition module 150 detects a keyword, keyword recognition module 150 may initiate a process for executing an action associated with the detected keyword. In some implementations, keyword recognition module 150 may always be listening, even if device 100 is not actively in use. For example, a smart oven may be always on, such that a user is able to simply speak the instruction “preheat oven to 350°” to initiate preheating of the oven without first manually interacting with the oven to activate speech recognition module 140. In some implementations, keyword recognition module 150 may be continuously listening only while device 100 is in use. In some implementations, context recognition module 160 may be configured to begin listening when the speech input signal is received from the microphone.
Context recognition module 160 may be configured to analyze the context of a keyword. Such an analysis may be useful to distinguish between the video game instruction “go” and conversation about an intention to “go” to the store. The context of a keyword may include words before or after the keyword, or an absence of words before or after the keyword. To this end, context recognition module 160 may also include a voice activity detector (VAD) for detecting silence for determining a context of detected keywords. The context of a keyword is not limited to the words spoken before and after the keyword, but context may also include additional information, such as biometric data, physical gestures, and body language, which may be determined using input device 107. Additionally, context may include the location of the person speaking the keyword, which may be determined by input device 107 using proximity sensors. For example, the keyword “go” spoken by a person playing a video game system in a living room has a different context than the keyword “go” spoken by a person standing in a garage.
In some implementations, context recognition module 160 includes grammar 170. Grammar 170 may contain a plurality of rules, and each rule may define a set of constraints that context recognition module 160 uses to restrict the possible word or sentence choices while analyzing the context of a detected keyword. Grammar 170 may include properties that are specific to the grammar, such as locale, semantic format, and mode. Grammar 170 includes properties that can be set to optimize speech recognition module 140 for specific recognition environments and tasks. For example, grammar 170 may include properties that specify the language that grammar 170 contains, which of the rules of grammar 170 to apply, and the format for semantic content of grammar 170.
Processing module 180 may optionally be included as a unique component of speech recognition module 140, or may alternatively be included in keyword recognition module 150 or context recognition module 160. Processing module 180 may determine when to initiate a process for executing an action associated with a detected keyword in keyword library 155. Processing module 180 may act as a gatekeeper to determine whether to execute the action associated with the keyword. In some implementations, processing module 180 receives input from keyword recognition module 150 and context recognition module 160, and processing module 180 determines when to proceed with a process for an action associated with the keyword.
When keyword recognition module 150 detects a keyword in digitized speech 108, keyword recognition module 150 may initiate a process for executing an action associated with the detected keyword. At the same time, and in parallel with keyword recognition module 150, context recognition module 160 may detect the keyword in digitized speech 108. When context recognition module 160 detects a keyword, context recognition module 160 may analyze the context of the detected keyword. Based on the context determined by context recognition module 160, processing module 180 may determine that the detected keyword is not an instruction, but instead part of a social conversation. In this situation, processing module 180, acting as a gatekeeper, may prevent keyword recognition module 150 from initiating a process for executing a command, terminate the process if already initiated, or allow the action associated with the keyword to be executed. In some implementations, the action associated with the keyword may include sending output signal 188 to component 190, which may be a display, a robot arm, a heating part of an oven, etc.
In some implementations, device 100 may receive additional input while analyzing the context of the detected keyword. Additional input may include words spoken by other individuals in the vicinity of device 100. In some implementations, additional input may include sensory input, visual input, biometric input, or other non-verbal input. In some implementations, device 100 may include input device 107 to receive additional input. For example, input device 107 may include a camera to receive additional input related to a user's body position, or a motion detector to detect a user's motion, such as a physical gesture, when keyword recognition module 150 or context recognition module 160 detects a keyword. In some implementations, context recognition module 160 may receive this additional input to assist in determining the context of the keyword and analyze the context of the keyword based on the second input.
Component 190 may be a visual output, an audio output, a signal, or a functional, mechanical or moving element of device 100 that may be instantiated by execution of the action associated with the keyword. In some implementations, component 190 may be a display, such as a computer monitor, a television, the display of a tablet computer, the display of a mobile telephone, or any display known in the art. In some implementations, component 190 may be a speaker, such as a speaker in a home stereo, a car stereo, in headphones, in a device with a display as above, or any device having a speaker. In some implementations, component 190 may be a functional component of device 100, such as a heating element of an oven, an electric motor of a fan, a motor of an automatic door, or other device that may be found in a smart home. In some implementations, component 190 may comprise an individual component or a plurality of components.
FIG. 2 shows an exemplary operational flow diagram for device 100, according to one implementation of the present disclosure. Flow diagram 200 depicts three distinct scenarios. The operational flow diagram begins with the user speaking or uttering one or more words, where 201 a/201 b/201 c show the spoken word(s), and two outgoing arrows depict the parallel processes of keyword recognition module 150 and context recognition module 160. At 202 a/202 b/202 c, keyword recognition module 150 and context recognition module 160 process digitized speech 108 received from A/D converter 115. At 203 a/203 b/203 c, keyword recognition module 150 detects keywords and context recognition module 160 detects a context for the detected keywords. In some implementations, context recognition module 160 may receive speech recognition result 157 from keyword recognition module 150. In other words, context recognition module 160 may receive information from keyword recognition module 150 that a keyword has been detected. At 204 a/204 b/204 c, processing module 180 determines whether to proceed with the execution of an action associated with detected keyword(s).
More specifically, at 201 a, the user utters the word “go” intended as a command. At 202 a, keyword recognition module 250 a detects the keyword “go” and initiates a process for executing an action associated with the keyword “go.” In one implementation, context recognition module 260 a, may receive an indication from keyword recognition module 250 a that the keyword “go” has been detected, and analyze the context of the keyword. In another implementation, context recognition module 260 a may itself detect the keyword “go” and analyze the context of the keyword. At 203 a, keyword recognition module 250 a sends a signal to processing module 180 to initiate an action associated with the keyword “go.” Also, context recognition module 260 a analyzes a context of the keyword and determines, based on the context of the keyword, that the keyword is more likely an instruction. As a result, context recognition module 260 a sends a signal to processing module 180 to proceed with executing the action. At 204 a, processing module 180 proceeds with the action associated with the keyword “go” to actuate component 190.
In one implementation, context recognition module 260 a may use speech recognition algorithms to determine the context for the detected keyword based on one or more words uttered before and/or after the keyword. In another implementation, context recognition module 260 a may determine the context for the detected keyword based on non-verbal indicators alone, such as the location of the speaker, body language of the speaker, etc. In other implementations, context recognition module 260 a may use a combination of verbal and non-verbal inputs to determine the context for the detected keyword.
Turning to another example, at 201 b, the user utters the keyword “go,” and may also utter a few other words before and/or after the keyword, where those other few words may be used determine a context for the keyword for calculating the probability of the keyword being an instruction. In some implementations, other indicators may be used, by themselves or in addition to contextual speech, to determine a context for the keyword for deciding whether the keyword should be considered an instruction or not, including non-verbal indicators such as the location of the speaker, body language of the speaker, etc. At 202 b, keyword recognition module 250 b detects the keyword “go” and initiates a process for executing an action associated with the keyword “go.” Context recognition module 260 b, may receive an indication from keyword recognition module 250 b regarding the detected keyword, or in another implementation, context recognition module 260 b may detect the keyword “go,” and further analyzes the context of the keyword to determine whether or not the detected keyword should classified as a command or instruction. At 203 b, keyword recognition module 250 b continues the process for executing the action associated with the keyword “go,” and context recognition module 260 a determines that the detected keyword is most likely not an instruction, based on the words spoken before and after the word “go,” and/or secondary input(s), such as pitch and intensity of speech, motion sensing, facial expression of the player, etc.
In response to receiving the inputs from keyword recognition module 250 b and context recognition module 260 b, at 204 b, processing module 180 terminates the action associated with the keyword “go.” In some implementations, termination of the action associated with the keyword may include ending the process initiated to execute the action associated with the keyword. For example, if the action were to preheat an oven to 350°, the oven may begin preheating and then the process of preheating may be terminated by turning off the heating element. In some implementations, termination of the action associated with the keyword may occur after execution of the action has occurred or begun to occur, and thus termination may include executing an action negating the action associated with the keyword. For example, if the action were to open a door, processing module 180 may terminate the action associated with the keyword “go” (terminate opening the door), and an action closing the door may be executed, thereby negating the initial action of beginning to open the door.
In another example, at 201 c, the user utters the keyword “go. At 202 c, keyword recognition module 250 c detects the keyword “go,” and initiates a process for an action associated with the keyword “go” to be taken by component 190. Context recognition module 260 c, operating along with keyword recognition module 250 c, may also detect the keyword “go,” or may receive an indication from keyword recognition module 250 c that the keyword “go” has been detected. In response, context recognition module 260 c determines the context of the keyword. As explained above, the context for the detected keyword may be determined using only one or a combination of inputs or factors. For example, a VAD may be used determine whether there is additional speech before and/or after the detected keyword. In one implementation, detection of additional speech may indicate a lower likelihood for the detected keyword being an instruction. In another example, grammar 170 of context recognition module 260 c may analyze additional speech before and/or after the detected keyword to determine the likelihood of the detected keyword being an instruction. As an example, whether the keyword “go” appears in “go for it” or “I want to go to the store now.” Context recognition module 260 c may also analyze non-verbal indicators such as the location of the speaker, body language of the speaker, facial expression, movement, etc.
In the example of 203 c, keyword recognition module 250 c continues the process for executing the action associated with the keyword “go,” and context recognition module 260 c determines, based on the context of the keyword, that the keyword is more likely not an instruction. As such, at 204 c, processing module 180 may terminate the process for executing the action associated with the keyword “go” before execution of the action by component 190 has begun.
FIG. 3 shows a flowchart illustrating exemplary speech recognition method 300 for use by device 100, according to one implementation of the present disclosure. As shown in FIG. 3, at 311, device 100 uses microphone 105 to receive input speech 106 spoken by a user. In one implementation, the microphone may be a peripheral device electronically connected to device 100. In some implementations, device 100 may use an array of microphones to determine the location from which speech originates. For example, a video game system having an array of microphones may be able to distinguish between speech coming from a player and speech coming from a person in another room.
At 312, device 100 uses A/D converter 115 to convert input speech 106 from an analog form to a digital form, and generates digitized speech 108. To convert the signal from analog to digital form, the A/D converter samples the analog signal at regular intervals and sends digitized speech 108 to speech recognition module 140.
At 313, keyword recognition module 150 detects a keyword in input speech 106. In some implementations, keyword recognition module is continuously listening for instances of keywords, and in other implementation, keyword recognition module 150 may include a VAD, such that the keyword recognition module 150 begins listening for instances of keywords when speech is detected. A keyword may be a word or series of words associated with an action. In some implementations, keyword recognition module 150 includes keyword library 155 which may include a plurality of keywords. Each keyword of keyword library 155 may be associated with a particular action. To detect a keyword in digitized speech 108, keyword recognition module 150 may pre-process digitized speech 108, extract features from the pre-processed digitized speech, and perform computation and scoring to match extracted features of the pre-processed digitized speech with keywords in keyword library 155.
At 314, keyword recognition module 150, in response to detecting a keyword in digitized speech 108, initiates a process for executing an action associated with the keyword. In some implementations, once keyword recognition module 150 initiates the process, processor 120 executes the action associated with the keyword with substantially no delay, other than the inherent delay of communicating signals within device 100. However, in some implementations, when keyword recognition module 150 initiates the process for executing an action associated with the keyword, execution of the action is delayed awaiting a determination by context recognition module 160 that the detected keyword is an instruction.
At 315, keyword recognition module 150 informs context recognition module 160 that digitized speech 108 includes the keyword. In some implementations, context recognition module 160 may independently determine that digitized speech 108 includes the keyword.
At 316, context recognition module 160 determines a context for the keyword. Determining the context for the keyword may be based on words before and/or after the detected keyword in input speech 106. For example, if context recognition module 160 determines the keyword to be a stand-alone word without any speech before or after the instruction keyword, the keyword is likely to be classified as an instruction. Context recognition module 160 may use a VAD to determine if there is any speech before or after the uttered keyword. In one implementation, the context may also include other sensory input, such as visual input, biometric input, or other non-verbal or verbal input. For example, context recognition module 160 may determine the keyword is less likely to be a command if the visual input indicates that the user is not facing the device. Further, context recognition module 160 may analyze the keywords appearing before and/or after the keyword using grammar 170, and additional factors, such as silence detection, location of the speaker, facial expression of the speaker, gesture and movements of the speaker, etc., to determine whether the keyword is more likely or less likely to be an instruction in view of the context.
Context recognition module 160 may also include grammar 170 for determining the context for uttered keywords based on the speech spoken before or after the uttered keywords. In one implementation, grammar 170 may independently detect keywords in speech, and may include keywords similar to those of keyword library 155 or more. Grammar 170 may contain a plurality of rules, where each rule defines a set of language constraints that context recognition module 160 uses to restrict possible word or sentence choices while determining the context for keywords that are designated as instructions or commands in context recognition module 160. Grammar 170 may include properties that can be set to optimize speech recognition module 140 for specific recognition environments and tasks. For example, grammar 170 may include properties that specify the language for grammar 170, which grammar 170 rules to use, and the format or semantic content for grammar 170.
Further, context recognition module 160 may use a VAD for silence and voice detection to determine if there is any speech before or after the uttered keyword. In one implementation, the context may also include other sensory input, such as visual input, biometric input, or other non-verbal or verbal input. For example, context recognition module 160 may determine the keyword is less likely to be a command if the visual input indicates that the user is not facing the device.
At 317, device 100 may proceed with the process of executing an action associated with the keyword if the context does not reject or does confirm execution of the action. In some implementations, context recognition module 160 determines the detected keyword is less likely or more likely to be a command or instruction based on the context. For example, based on the context, context recognition module 160 may determine a probability that the keyword is intended as an instruction, such as a 10%, 20%, 30% chance, etc. that the keyword is intended as an instruction.
From the above description, it is manifest that various techniques can be used for implementing the concepts described in the present application without departing from the scope of those concepts. Moreover, while the concepts have been described with specific reference to certain implementations, a person of ordinary skill in the art would recognize that changes can be made in form and detail without departing from the scope of those concepts. As such, the described implementations are to be considered in all respects as illustrative and not restrictive. It should also be understood that the present application is not limited to the particular implementations described above, but many rearrangements, modifications, and substitutions are possible without departing from the scope of the present disclosure.

Claims (20)

What is claimed is:
1. A device comprising:
a plurality of components;
a memory including a keyword recognition module and a context recognition module;
a microphone configured to receive an input speech spoken by a user;
an analog-to-digital converter configured to convert the input speech from an analog form to a digital form and generate a digitized speech;
a processor configured to:
detect, using the keyword recognition module, a keyword in the digitized speech based on features extracted from the digitized speech;
initiate, in response to detecting the keyword by the keyword recognition module, an action to be taken by one of the plurality of components, wherein the keyword is defined as a command to take the action;
determine, using the context recognition module and after detecting the keyword, a context for the keyword based on additional features extracted from a portion of the digitized speech before and/or after the keyword, wherein the context is used to determine whether or not the keyword should be considered to be the command; and
execute the action if the context determined by the context recognition module indicates that the keyword should be considered to be the command.
2. The device of claim 1, wherein the context recognition module utilizes a voice activity detector to determine the context.
3. The device of claim 1, wherein the keyword recognition module is continuously listening for the keyword and the context recognition module is configured to begin listening when the input speech is received from the microphone.
4. The device of claim 1, wherein the processor is configured to:
prior to determining the context of the keyword, receive one or more second inputs from the user; and
analyze the context of the keyword based on the one or more second inputs.
5. The device of claim 4, wherein the one or more second inputs is a non-verbal input including a physical gesture.
6. The device of claim 4, wherein the one or more second inputs are received from one of a motion sensor and a video camera.
7. The device of claim 1, wherein the context recognition module determines that the keyword is in the digitized speech based on an indication received from the keyword recognition module.
8. The device of claim 1, wherein the context of the keyword includes a location of the user.
9. The device of claim 1, wherein the processor is further configured to:
display a result of executing the action on a display.
10. The device of claim 1, wherein the processor is further configured to:
terminate the action if the context determined by the context recognition module indicates that the keyword should not be considered to be the command.
11. A method for speech recognition by a device having a microphone, a processor, and a memory including a keyword recognition module and a context recognition module, the method comprising:
detecting, using the keyword recognition module, a keyword in a digitized speech based on features extracted from the digitized speech;
initiating, in response to detecting the keyword by the keyword recognition module, an action to be taken by one of the plurality of components, wherein the keyword is defined as a command to take the action;
determining, using the context recognition module and after detecting the keyword, a context for the keyword based on additional features extracted from a portion of the digitized speech before and/or after the keyword, wherein the context is used to determine whether or not the keyword should be considered to be the command; and
executing the action if the context determined by the context recognition module indicates that the keyword should be considered to be the command.
12. The method of claim 11, wherein the context recognition module utilizes a voice activity detector to determine the context.
13. The method of claim 11, wherein the keyword recognition module is continuously listening for the keyword and the context recognition module is configured to begin listening when the input speech is received from the microphone.
14. The method of claim 11, further comprising:
prior to determining the context of the keyword, receiving one or more second inputs from the user; and
analyzing the context of the keyword based on the one or more second inputs.
15. The method of claim 14, wherein the one or more second inputs include a non-verbal input including a physical gesture.
16. The method of claim 14, wherein the one or more second inputs are received from one of a motion sensor and a video camera.
17. The method of claim 11, wherein the context recognition module determines that the keyword is in the digitized speech based on an indication received from the keyword recognition module.
18. The method of claim 11, wherein the context of the keyword includes a location of a user.
19. The method of claim 11 further comprising:
terminating the action if the context determined by the context recognition module indicates that the keyword should not be considered to be the command.
20. A device comprising:
a plurality of components;
a memory including a keyword recognition module and a context recognition module;
a microphone configured to receive an input speech spoken by a user;
an analog-to-digital converter configured to convert the input speech from an analog form to a digital form and generate a digitized speech;
a processor configured to:
detect, using the keyword recognition module, a keyword in the digitized speech based on features extracted from the digitized speech;
initiate, in response to detecting the keyword by the keyword recognition module, an action to be taken by one of the plurality of components, wherein the keyword is defined as a command to take the action;
determine, using the context recognition module and after detecting the keyword, a context for the keyword based on additional features extracted from the digitized speech before and after the keyword, wherein the context is used to determine whether or not the keyword should be considered to be the command;
execute the action if the context determined by the context recognition module indicates that the keyword should be considered to be the command; and
terminate the action if the context determined by the context recognition module indicates that the keyword should not be considered to be the command.
US14/754,457 2015-06-29 2015-06-29 Speech-controlled actions based on keywords and context thereof Active 2035-12-14 US10019992B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/754,457 US10019992B2 (en) 2015-06-29 2015-06-29 Speech-controlled actions based on keywords and context thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US14/754,457 US10019992B2 (en) 2015-06-29 2015-06-29 Speech-controlled actions based on keywords and context thereof

Publications (2)

Publication Number Publication Date
US20160379633A1 US20160379633A1 (en) 2016-12-29
US10019992B2 true US10019992B2 (en) 2018-07-10

Family

ID=57602711

Family Applications (1)

Application Number Title Priority Date Filing Date
US14/754,457 Active 2035-12-14 US10019992B2 (en) 2015-06-29 2015-06-29 Speech-controlled actions based on keywords and context thereof

Country Status (1)

Country Link
US (1) US10019992B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12254717B2 (en) 2022-06-23 2025-03-18 Universal City Studios Llc Interactive imagery systems and methods

Families Citing this family (108)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US8977255B2 (en) 2007-04-03 2015-03-10 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US10706373B2 (en) 2011-06-03 2020-07-07 Apple Inc. Performing actions associated with task items that represent tasks to perform
US10276170B2 (en) 2010-01-18 2019-04-30 Apple Inc. Intelligent automated assistant
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
CN104969289B (en) 2013-02-07 2021-05-28 苹果公司 Voice trigger of digital assistant
US10652394B2 (en) 2013-03-14 2020-05-12 Apple Inc. System and method for processing voicemail
US10748529B1 (en) 2013-03-15 2020-08-18 Apple Inc. Voice activated device for use with a voice-based digital assistant
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
CN110442699A (en) 2013-06-09 2019-11-12 苹果公司 Operate method, computer-readable medium, electronic equipment and the system of digital assistants
JP6163266B2 (en) 2013-08-06 2017-07-12 アップル インコーポレイテッド Automatic activation of smart responses based on activation from remote devices
US10296160B2 (en) 2013-12-06 2019-05-21 Apple Inc. Method for extracting salient dialog usage from live data
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
WO2015184186A1 (en) 2014-05-30 2015-12-03 Apple Inc. Multi-command single utterance input method
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US10152299B2 (en) 2015-03-06 2018-12-11 Apple Inc. Reducing response latency of intelligent automated assistants
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
EP3067884B1 (en) * 2015-03-13 2019-05-08 Samsung Electronics Co., Ltd. Speech recognition system and speech recognition method thereof
US10460227B2 (en) 2015-05-15 2019-10-29 Apple Inc. Virtual assistant in a communication session
US10200824B2 (en) 2015-05-27 2019-02-05 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on a touch-sensitive device
US10083688B2 (en) 2015-05-27 2018-09-25 Apple Inc. Device voice control for selecting a displayed affordance
US9578173B2 (en) 2015-06-05 2017-02-21 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US20160378747A1 (en) 2015-06-29 2016-12-29 Apple Inc. Virtual assistant for media playback
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10740384B2 (en) 2015-09-08 2020-08-11 Apple Inc. Intelligent automated assistant for media search and playback
US10331312B2 (en) 2015-09-08 2019-06-25 Apple Inc. Intelligent automated assistant in a media environment
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US11587559B2 (en) 2015-09-30 2023-02-21 Apple Inc. Intelligent device identification
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10956666B2 (en) 2015-11-09 2021-03-23 Apple Inc. Unconventional virtual assistant interactions
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
JP2017117371A (en) * 2015-12-25 2017-06-29 パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America Control method, control device, and program
US10388280B2 (en) * 2016-01-27 2019-08-20 Motorola Mobility Llc Method and apparatus for managing multiple voice operation trigger phrases
US10843080B2 (en) * 2016-02-24 2020-11-24 Virginia Tech Intellectual Properties, Inc. Automated program synthesis from natural language for domain specific computing applications
US11227589B2 (en) 2016-06-06 2022-01-18 Apple Inc. Intelligent list reading
US12223282B2 (en) 2016-06-09 2025-02-11 Apple Inc. Intelligent automated assistant in a home environment
US10586535B2 (en) 2016-06-10 2020-03-10 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US12197817B2 (en) 2016-06-11 2025-01-14 Apple Inc. Intelligent device arbitration and control
DK201670540A1 (en) 2016-06-11 2018-01-08 Apple Inc Application integration with a digital assistant
DK179415B1 (en) 2016-06-11 2018-06-14 Apple Inc Intelligent device arbitration and control
US10885915B2 (en) * 2016-07-12 2021-01-05 Apple Inc. Intelligent software agent
US10311863B2 (en) 2016-09-02 2019-06-04 Disney Enterprises, Inc. Classifying segments of speech based on acoustic features and context
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
US11204787B2 (en) 2017-01-09 2021-12-21 Apple Inc. Application integration with a digital assistant
DK201770383A1 (en) 2017-05-09 2018-12-14 Apple Inc. User interface for correcting recognition errors
DK180048B1 (en) 2017-05-11 2020-02-04 Apple Inc. MAINTAINING THE DATA PROTECTION OF PERSONAL INFORMATION
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
DK179745B1 (en) 2017-05-12 2019-05-01 Apple Inc. SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
DK201770428A1 (en) 2017-05-12 2019-02-18 Apple Inc. Low-latency intelligent automated assistant
US11301477B2 (en) 2017-05-12 2022-04-12 Apple Inc. Feedback analysis of a digital assistant
DK201770411A1 (en) 2017-05-15 2018-12-20 Apple Inc. MULTI-MODAL INTERFACES
US10303715B2 (en) 2017-05-16 2019-05-28 Apple Inc. Intelligent automated assistant for media exploration
US20180336892A1 (en) 2017-05-16 2018-11-22 Apple Inc. Detecting a trigger of a digital assistant
DK179560B1 (en) 2017-05-16 2019-02-18 Apple Inc. Far-field extension for digital assistant services
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
US10891947B1 (en) * 2017-08-03 2021-01-12 Wells Fargo Bank, N.A. Adaptive conversation support bot
US10553235B2 (en) 2017-08-28 2020-02-04 Apple Inc. Transparent near-end user control over far-end speech enhancement processing
JP2019086903A (en) * 2017-11-02 2019-06-06 東芝映像ソリューション株式会社 Speech interaction terminal and speech interaction terminal control method
US10592604B2 (en) 2018-03-12 2020-03-17 Apple Inc. Inverse text normalization for automatic speech recognition
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
KR102715536B1 (en) * 2018-03-29 2024-10-11 삼성전자주식회사 Electronic device and control method thereof
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
DK201870355A1 (en) 2018-06-01 2019-12-16 Apple Inc. Virtual assistant operation in multi-device environments
DK179822B1 (en) 2018-06-01 2019-07-12 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
DK180639B1 (en) 2018-06-01 2021-11-04 Apple Inc DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT
US10944859B2 (en) 2018-06-03 2021-03-09 Apple Inc. Accelerated task performance
US10878202B2 (en) * 2018-08-03 2020-12-29 International Business Machines Corporation Natural language processing contextual translation
JP7001029B2 (en) * 2018-09-11 2022-01-19 日本電信電話株式会社 Keyword detector, keyword detection method, and program
US11010561B2 (en) 2018-09-27 2021-05-18 Apple Inc. Sentiment prediction from textual data
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US10839159B2 (en) 2018-09-28 2020-11-17 Apple Inc. Named entity normalization in a spoken dialog system
US11170166B2 (en) 2018-09-28 2021-11-09 Apple Inc. Neural typographical error modeling via generative adversarial networks
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
CN110217242A (en) * 2019-04-25 2019-09-10 深圳航天科创智能科技有限公司 A kind of auto navigation audio recognition method and system
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
DK201970509A1 (en) 2019-05-06 2021-01-15 Apple Inc Spoken notifications
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
DK201970510A1 (en) 2019-05-31 2021-02-11 Apple Inc Voice identification in digital assistant systems
DK180129B1 (en) 2019-05-31 2020-06-02 Apple Inc. USER ACTIVITY SHORTCUT SUGGESTIONS
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11468890B2 (en) 2019-06-01 2022-10-11 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
WO2021056255A1 (en) 2019-09-25 2021-04-01 Apple Inc. Text detection using global geometry estimators
US12301635B2 (en) 2020-05-11 2025-05-13 Apple Inc. Digital assistant hardware abstraction
US11513667B2 (en) 2020-05-11 2022-11-29 Apple Inc. User interface for audio message
US11061543B1 (en) 2020-05-11 2021-07-13 Apple Inc. Providing relevant data items based on context
US11038934B1 (en) 2020-05-11 2021-06-15 Apple Inc. Digital assistant hardware abstraction
US11755276B2 (en) 2020-05-12 2023-09-12 Apple Inc. Reducing description length based on confidence
US11490204B2 (en) 2020-07-20 2022-11-01 Apple Inc. Multi-device audio adjustment coordination
US11438683B2 (en) 2020-07-21 2022-09-06 Apple Inc. User identification using headphones

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7240006B1 (en) * 2000-09-27 2007-07-03 International Business Machines Corporation Explicitly registering markup based on verbal commands and exploiting audio context
US20110313768A1 (en) * 2010-06-18 2011-12-22 Christian Klein Compound gesture-speech commands
US8340975B1 (en) * 2011-10-04 2012-12-25 Theodore Alfred Rosenberger Interactive speech recognition device and system for hands-free building control
US20130090930A1 (en) * 2011-10-10 2013-04-11 Matthew J. Monson Speech Recognition for Context Switching
US20140163976A1 (en) * 2012-12-10 2014-06-12 Samsung Electronics Co., Ltd. Method and user device for providing context awareness service using speech recognition
US8781825B2 (en) 2011-08-24 2014-07-15 Sensory, Incorporated Reducing false positives in speech recognition systems
US9020825B1 (en) * 2012-09-25 2015-04-28 Rawles Llc Voice gestures

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7240006B1 (en) * 2000-09-27 2007-07-03 International Business Machines Corporation Explicitly registering markup based on verbal commands and exploiting audio context
US20110313768A1 (en) * 2010-06-18 2011-12-22 Christian Klein Compound gesture-speech commands
US8781825B2 (en) 2011-08-24 2014-07-15 Sensory, Incorporated Reducing false positives in speech recognition systems
US8340975B1 (en) * 2011-10-04 2012-12-25 Theodore Alfred Rosenberger Interactive speech recognition device and system for hands-free building control
US20130090930A1 (en) * 2011-10-10 2013-04-11 Matthew J. Monson Speech Recognition for Context Switching
US9020825B1 (en) * 2012-09-25 2015-04-28 Rawles Llc Voice gestures
US20140163976A1 (en) * 2012-12-10 2014-06-12 Samsung Electronics Co., Ltd. Method and user device for providing context awareness service using speech recognition

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12254717B2 (en) 2022-06-23 2025-03-18 Universal City Studios Llc Interactive imagery systems and methods

Also Published As

Publication number Publication date
US20160379633A1 (en) 2016-12-29

Similar Documents

Publication Publication Date Title
US10019992B2 (en) Speech-controlled actions based on keywords and context thereof
US10311863B2 (en) Classifying segments of speech based on acoustic features and context
JP6887031B2 (en) Methods, electronics, home appliances networks and storage media
JP4557919B2 (en) Audio processing apparatus, audio processing method, and audio processing program
US9293134B1 (en) Source-specific speech interactions
JP6447578B2 (en) Voice dialogue apparatus and voice dialogue method
US20190279642A1 (en) System and method for speech understanding via integrated audio and visual based speech recognition
US11978478B2 (en) Direction based end-pointing for speech recognition
US9711148B1 (en) Dual model speaker identification
US11508378B2 (en) Electronic device and method for controlling the same
US20220101856A1 (en) System and method for disambiguating a source of sound based on detected lip movement
JP6350903B2 (en) Operation assistance device and operation assistance method
JP6797338B2 (en) Information processing equipment, information processing methods and programs
US11830502B2 (en) Electronic device and method for controlling the same
KR101644015B1 (en) Communication interface apparatus and method for multi-user and system
KR20210042523A (en) An electronic apparatus and Method for controlling the electronic apparatus thereof
JP7215417B2 (en) Information processing device, information processing method, and program
KR20210042520A (en) An electronic apparatus and Method for controlling the electronic apparatus thereof
EP3839719B1 (en) Computing device and method of operating the same
CN111055291B (en) Guidance robot system and guidance method
CN110653812B (en) Interaction method of robot, robot and device with storage function
JP2007155986A (en) Voice recognition device and robot equipped with the same
KR100622019B1 (en) Voice interface system and method
JP6748565B2 (en) Voice dialogue system and voice dialogue method
Safie et al. Unmanned Aerial Vehicle (UAV) Control Through Speech Recognition

Legal Events

Date Code Title Description
AS Assignment

Owner name: DISNEY ENTERPRISES, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LEHMAN, JILL FAIN;AL MOUBAYED, SAMER;REEL/FRAME:035932/0572

Effective date: 20150629

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4