Frank Kao-Ping Soong, Beijing CN

Patent application number	Description	Published
20080201145	Unsupervised labeling of sentence level accent - Methods are disclosed for automatic accent labeling without manually labeled data. The methods are designed to exploit accent distribution between function and content words.	08-21-2008
20080208574	Name synthesis - An automated method of providing a pronunciation of a word to a remote device is disclosed. The method includes receiving an input indicative of the word to be pronounced. The method further includes searching a database having a plurality of records. Each of the records has an indication of a textual representation and an associated indication of an audible representation. At least one output is provided to the remote device of an audible representation of the word to be pronounced.	08-28-2008
20080240570	SYMBOL GRAPH GENERATION IN HANDWRITTEN MATHEMATICAL EXPRESSION RECOGNITION - A forward pass through a sequence of strokes representing a handwritten equation is performed from the first stroke to the last stroke in the sequence. At each stroke, a path score is determined for a plurality of symbol-relation pairs that each represents a symbol and its spatial relation to a predecessor symbol. A symbol graph having nodes and links is constructed by backtracking through the strokes from the last stroke to the first stroke and assigning scores to the links based on the path scores for the symbol-relation pairs. The symbol graph is used to recognize a mathematical expression based in part on the scores for the links and the mathematical expression is stored.	10-02-2008
20080243503	MINIMUM DIVERGENCE BASED DISCRIMINATIVE TRAINING FOR PATTERN RECOGNITION - A method of providing discriminative training of a speech recognition unit is discussed. The method includes receiving an acoustic indication of an utterance having a hypothesis space and comparing the hypothesis space against a reference. The method measures the Kullback-Leibler Divergence (KLD) between the reference and the hypothesis space to adjust the reference and stores the adjusted reference on a tangible storage medium.	10-02-2008
20090240501	AUTOMATICALLY GENERATING NEW WORDS FOR LETTER-TO-SOUND CONVERSION - Described is a technology by which artificial words are generated based on seed words, and then used with a letter-to-sound conversion model. To generate an artificial word, a stressed syllable of a seed word is replaced with a different syllable, such as a candidate (artificial) syllable, when the phonemic structure and/or graphonemic structure of the stressed syllable and the candidate syllable match one another. In one aspect, the artificial words are provided for use with a letter-to-sound conversion model, which may be used to generate artificial phonemes from a source of words, such as in conjunction with other models. If the phonemes provided by the various models for a selected source word are in agreement relative to one another, the selected source word and an associated artificial phoneme may be added to a training set which may then be used to retrain the letter-to-sound conversion model.	09-24-2009
20090245646	Online Handwriting Expression Recognition - One way of recognizing online handwritten mathematical expressions is to use a one-pass dynamic programming based symbol decoding generation algorithm. This method embeds segmentation into symbol identification to form a unified framework for symbol recognition. Along with decoding, a symbol graph is produced. Besides accurately recognizing handwritten mathematical expressions, this method can produce high quality symbol graphs. This method uses six knowledge source models to help search for possible symbol hypotheses during the decoding process. Here, knowledge source exponential weights and a symbol insertion penalty are used to weigh the various knowledge source model probabilities to increase accuracy.	10-01-2009
20090324082	CHARACTER AUTO-COMPLETION FOR ONLINE EAST ASIAN HANDWRITING INPUT - An exemplary method includes receiving stroke information for a partially written East Asian character, the East Asian character representable by one or more radicals; based on the stroke information, selecting a radical on a prefix tree wherein the prefix tree branches to East Asian characters as end states; identifying one or more East Asian characters as end states that correspond to the selected radical for the partially written East Asian character; and receiving user input to verify that one of the identified one or more East Asian characters is the end state for the partially written East Asian character. In such a method, the selection of a radical can occur using radical-based hidden Markov models. Various other exemplary methods, devices, systems, etc., are also disclosed.	12-31-2009
20100066742	STYLIZED PROSODY FOR SPEECH SYNTHESIS-BASED APPLICATIONS - Described is a technology by which the prosody of synthesized speech may be changed by varying data associated with that speech. An interface displays a visual representation of synthesized speech as one or more waveforms, along with the corresponding text from which the speech was synthesized. The user may interact with the visual representation to change data corresponding to the prosody, e.g., to change duration, pitch and/or loudness data, with respect to a part (or all) of the speech. The part of the speech that may be varied may comprise a phoneme, a morpheme, a syllable, a word, a phrase, and/or a sentence. The changed speech can be played back to hear the change in prosody resulting from the interactive changes. The user can also change the text and hear/see newly synthesized speech, which may then be similarly edited to change data that corresponds to that speech's prosody.	03-18-2010
20100082345	SPEECH AND TEXT DRIVEN HMM-BASED BODY ANIMATION SYNTHESIS - An “Animation Synthesizer” uses trainable probabilistic models, such as Hidden Markov Models (HMM), Artificial Neural Networks (ANN), etc., to provide speech and text driven body animation synthesis. Probabilistic models are trained using synchronized motion and speech inputs (e.g., live or recorded audio/video feeds) at various speech levels, such as sentences, phrases, words, phonemes, sub-phonemes, etc., depending upon the available data, and the motion type or body part being modeled. The Animation Synthesizer then uses the trainable probabilistic model for selecting animation trajectories for one or more different body parts (e.g., face, head, hands, arms, etc.) based on an arbitrary text and/or speech input. These animation trajectories are then used to synthesize a sequence of animations for digital avatars, cartoon characters, computer generated anthropomorphic persons or creatures, actual motions for physical robots, etc., that are synchronized with a speech output corresponding to the text and/or speech input.	04-01-2010
20100166314	Segment Sequence-Based Handwritten Expression Recognition - Methods and apparatuses for generating, by a computing device configured to interpret a handwritten expression, a symbol graph to represent strokes associated with the handwritten expression, are described herein. The symbol graph may include nodes, each node corresponding to a combination of a stroke and a candidate symbol for that stroke. The computing device may also generate a segment graph based on the symbol graph by combining nodes associated with a same stroke if strokes of their preceding nodes are the same. Also the computing device may perform a structure analysis on at least a subset of segment sequences represented by the segment graph to determine hypotheses for the handwritten expression. In other embodiments, rather than generate a segment graph, the computing device may determine segment sequences by selecting a number of symbol sequences from the symbol graph and combining symbol sequences having the same segmentation.	07-01-2010
20100198577	STATE MAPPING FOR CROSS-LANGUAGE SPEAKER ADAPTATION - Creation of sub-phonemic Hidden Markov Model (HMM) states and the mapping of those states results in improved cross-language speaker adaptation. The smaller sub-phonemic mapping provides improvements in usability and intelligibility particularly between languages with few common phonemes. HMM states of different languages may be mapped to one another using a distance between the HMM states in acoustic space. This distance may be calculated using Kullback-Leibler divergence and multi-space probability distribution. By combining distance mapping and context mapping for different speakers of the same language improved cross-language speaker adaptation is possible.	08-05-2010
20110054903	RICH CONTEXT MODELING FOR TEXT-TO-SPEECH ENGINES - Embodiments of rich text modeling for speech synthesis are disclosed. In operation, a text-to-speech engine refines a plurality of rich context models based on decision tree-tied Hidden Markov Models (HMMs) to produce a plurality of refined rich context models. The text-to-speech engine then generates synthesized speech for an input text based at least on some of the plurality of refined rich context models.	03-03-2011
20110071835	SMALL FOOTPRINT TEXT-TO-SPEECH ENGINE - Embodiments of small footprint text-to-speech engine are disclosed. In operation, the small footprint text-to-speech engine generates a set of feature parameters for an input text. The set of feature parameters includes static feature parameters and delta feature parameters. The small footprint text-to-speech engine then derives a saw-tooth stochastic trajectory that represents the speech characteristics of the input text based on the static feature parameters and the delta parameters. Finally, the small footprint text-to-speech engine produces a smoothed trajectory from the saw-tooth stochastic trajectory, and generates synthesized speech based on the smoothed trajectory.	03-24-2011
20110184723	PHONETIC SUGGESTION ENGINE - A phonetic suggestion engine for providing word or phrase suggestions for an input letter string initially converts an input letter string into one or more query phoneme sequences. The conversion is performed via at least one standardized letter-to-sound (LTS) database. The phonetic suggestion engine further obtains a plurality of candidate phoneme sequences that are phonetically similar to the at query phoneme sequences from a pool of potential phoneme sequences. The phonetic suggestion engine then prunes the plurality of candidate phoneme sequences to generate scored phoneme sequences. The phonetic suggestion engine subsequently generates a plurality of ranked word or phrase suggestions based on the scored phoneme sequences.	07-28-2011
20120116761	Minimum Converted Trajectory Error (MCTE) Audio-to-Video Engine - Embodiments of an audio-to-video engine are disclosed. In operation, the audio-to-video engine generates facial movement (e.g., a virtual talking head) based on an input speech. The audio-to-video engine receives the input speech and recognizes the input speech as a source feature vector. The audio-to-video engine then determines a Maximum A Posterior (MAP) mixture sequence based on the source feature vector. The MAP mixture sequence may be a function of a refined Gaussian Mixture Model (GMM). The audio-to-video engine may then use the MAP to estimate video feature parameters. The video feature parameters are then interpreted as facial movement. The facial movement may be stored as data to a storage module and/or it may be displayed as video to a display device.	05-10-2012
20120130717	Real-time Animation for an Expressive Avatar - Techniques for providing real-time animation for a personalized cartoon avatar are described. In one example, a process trains one or more animated models to provide a set of probabilistic motions of one or more upper body parts based on speech and motion data. The process links one or more predetermined phrases that represent emotional states to the one or more animated models. After creation of the models, the process receives real-time speech input. Next, the process identifies an emotional state to be expressed based on the one or more predetermined phrases matching in context to the real-time speech input. The process then generates an animated sequence of motions of the one or more upper body parts by applying the one or more animated models in response to the real-time speech input.	05-24-2012
20120143611	Trajectory Tiling Approach for Text-to-Speech - Hidden Markov Models HMM trajectory tiling (HTT)-based approaches may be used to synthesize speech from text. In operation, a set of Hidden Markov Models (HMMs) and a set of waveform units may be obtained from a speech corpus. The set of HMMs are further refined via minimum generation error (MGE) training to generate a refined set of HMMs. Subsequently, a speech parameter trajectory may be generated by applying the refined set of HMMs to an input text. A unit lattice of candidate waveform units may be selected from the set of waveform units based at least on the speech parameter trajectory. A normalized cross-correlation (NCC)-based search on the unit lattice may be performed to obtain a minimal concatenation cost sequence of candidate waveform units, which are concatenated into a concatenated waveform sequence that is synthesized into speech.	06-07-2012
20120191456	POSITION-DEPENDENT PHONETIC MODELS FOR RELIABLE PRONUNCIATION IDENTIFICATION - A representation of a speech signal is received and is decoded to identify a sequence of position-dependent phonetic tokens wherein each token comprises a phone and a position indicator that indicates the position of the phone within a syllable.	07-26-2012
20120253781	FRAME MAPPING APPROACH FOR CROSS-LINGUAL VOICE TRANSFORMATION - Frame mapping-based cross-lingual voice transformation may transform a target speech corpus in a particular language into a transformed target speech corpus that remains recognizable, and has the voice characteristics of a target speaker that provided the target speech corpus. A formant-based frequency warping is performed on the fundamental frequencies and the linear predictive coding (LPC) spectrums of source speech waveforms in a first language to produce transformed fundamental frequencies and transformed LPC spectrums. The transformed fundamental frequencies and the transformed LPC spectrums are then used to generate warped parameter trajectories. The warped parameter trajectories are further used to transform the target speech waveforms in the second language to produce transformed target speech waveform with voice characteristics of the first language that nevertheless retain at least some voice characteristics of the target speaker.	10-04-2012
20120276504	Talking Teacher Visualization for Language Learning - A representation of a virtual language teacher assists in language learning. The virtual language teacher may appear as a “talking head” in a video that a student views to practice pronunciation of a foreign language. A system for generating a virtual language teacher receives input text. The system may generate a video showing the virtual language teacher as a talking head having a mouth that moves in synchronization with speech generated from the input text. The video of the virtual language teacher may then be presented to the student.	11-01-2012
20130218566	AUDIO HUMAN INTERACTIVE PROOF BASED ON TEXT-TO-SPEECH AND SEMANTICS - The text-to-speech audio HIP technique described herein in some embodiments uses different correlated or uncorrelated words or sentences generated via a text-to-speech engine as audio HIP challenges. The technique can apply different effects in the text-to-speech synthesizer speaking a sentence to be used as a HIP challenge string. The different effects can include, for example, spectral frequency warping; vowel duration warping; background addition; echo addition; and varying the time duration between words, among others. In some embodiments the technique varies the set of parameters to prevent using Automated Speech Recognition tools from using previously used audio HIP challenges to learn a model which can then be used to recognize future audio HIP challenges generated by the technique. Additionally, in some embodiments the technique introduces the requirement of semantic understanding in HIP challenges.	08-22-2013
20140025381	EVALUATING TEXT-TO-SPEECH INTELLIGIBILITY USING TEMPLATE CONSTRAINED GENERALIZED POSTERIOR PROBABILITY - Instead of relying on humans to subjectively evaluate speech intelligibility of a subject, a system objectively evaluates the speech intelligibility. The system receives speech input and calculates confidence scores at multiple different levels using a Template Constrained Generalized Posterior Probability algorithm. One or multiple intelligibility classifiers are utilized to classify the desired entities on an intelligibility scale. A specific intelligibility classifier utilizes features such as the various confidence scores. The scale of the intelligibility classification can be adjusted to suit the application scenario. Based on the confidence score distributions and the intelligibility classification results at multiple levels an overall objective intelligibility score is calculated. The objective intelligibility scores can be used to rank different subjects or systems being assessed according to their intelligibility levels. The speech that is below a predetermined intelligibility (e.g. utterances with low confidence scores and most severe intelligibility issues) can be automatically selected for further analysis.	01-23-2014

Patent applications by Frank Kao-Ping Soong, Beijing CN

Inventors list

Assignees list

Classification tree browser

Top 100 Inventors

Top 100 Assignees

Frank Kao-Ping Soong, Beijing CN

Frank Kao-Ping Soong, Beijing CN