Inventors list

Assignees list

Classification tree browser

Top 100 Inventors

Top 100 Assignees


Image to speech

Subclass of:

704 - Data processing: speech signal processing, linguistics, language translation, and audio compression/decompression

704200000 - SPEECH SIGNAL PROCESSING

704258000 - Synthesis

Patent class list (only not empty are listed)

Deeper subclasses:

Entries
DocumentTitleDate
20130030810FRUGAL METHOD AND SYSTEM FOR CREATING SPEECH CORPUS - The present invention provides a frugal method for extraction of speech data and associated transcription from plurality of web resources (internet) for speech corpus creation characterized by an automation of the speech corpus creation and cost reduction. An integration of existing speech corpus with extracted speech data and its transcription from the web resources to build an aggregated rich speech corpus that are effective and easy to adapt for generating acoustic and language models for (Automatic Speech Recognition) ASR systems.01-31-2013
20130211838APPARATUS AND METHOD FOR EMOTIONAL VOICE SYNTHESIS - The present disclosure provides an emotional voice synthesis apparatus and an emotional voice synthesis method. The emotional voice synthesis apparatus includes a word dictionary storage unit for storing emotional words in an emotional word dictionary after classifying the emotional words into items each containing at least one of an emotion class, similarity, positive or negative valence, and sentiment strength; voice DB storage unit for storing voices in a database after classifying the voices according to at least one of emotion class, similarity, positive or negative valence and sentiment strength in correspondence to the emotional words; emotion reasoning unit for inferring an emotion matched with the emotional word dictionary with respect to at least one of each word, phrase, and sentence of document including text and e-book; and voice output unit for selecting and outputting a voice corresponding to the document from the database according to the inferred emotion.08-15-2013
20130211837SYSTEM AND METHOD FOR MAKING AN ELECTRONIC HANDHELD DEVICE MORE ACCESSIBLE TO A DISABLED PERSON - An electronic handheld device is described having an options module for providing a user with at least one option in the handheld device, each option associated with an enabling mode of operation of the handheld device. The device also includes an enabling module for implementing, in response to a particular option being selected by a user, an associated enabling mode of operation. Each enabling mode of operation makes the handheld device more accessible to a person having a corresponding disability.08-15-2013
20100153114AUDIO OUTPUT OF A DOCUMENT FROM MOBILE DEVICE - Architecture for playing a document converted into an audio format to a user of an audio-output capable device. The user can interact with the device to control play of the audio document such as pause, rewind, forward, etc. In more robust implementation, the audio-output capable device is a mobile device (e.g., cell phone) having a microphone for processing voice input. Voice commands can then be input to control play (“reading”) of the document audio file to pause, rewind, read paragraph, read next chapter, fast forward, etc. A communications server (e.g., email, attachments to email, etc.) transcodes text-based document content into an audio format by leveraging a text-to-speech (TTS) engine. The transcoded audio files are then transferred to mobile devices through viable transmission channels. Users can then play the audio-formatted document while freeing hand and eye usage for other tasks.06-17-2010
20090055187Conversion of text email or SMS message to speech spoken by animated avatar for hands-free reception of email and SMS messages while driving a vehicle - Subscribers can access and listen to their email while they drive, access to the email messages being hands-free so a person can listen to email while they drive. In further accord with the present invention, a selectable avatar speaks the email message. And, the invention provides unified messaging such that SMS and email are unified and present and spoken by the avatar, so the subscriber need not access two devices (an instant message device, and an email device). Additionally, the invention can convert natural language to an acronym to be spoken by the avatar, and can convert acronyms in a message to natural language spoken by the avatar; subscriber selects the desired one of these two.02-26-2009
20110196679Systems And Methods For Machine To Operator Communications - Systems and methods for machine to operator communications are disclosed. For example, one disclosed system includes a concentrator having a memory; a radio transmitter; and a processor in communication with the memory and the radio transmitter, the processor configured to: request information associated with a status of a machine; receive information associated with the request; determine a message based on the received information; generate an audio signal based on the message; and transmit the audio signal to the radio transmitter.08-11-2011
20110202347COMMUNICATION CONVERTER FOR CONVERTING AUDIO INFORMATION/TEXTUAL INFORMATION TO CORRESPONDING TEXTUAL INFORMATION/AUDIO INFORMATION - A communication converter is described for converting among speech signals and textual information, permitting communication between telephone users and textual instant communications users.08-18-2011
20110202346METHOD AND APPARATUS FOR GENERATING SYNTHETIC SPEECH WITH CONTRASTIVE STRESS - Techniques for generating synthetic speech with contrastive stress. In one aspect, a speech-enabled application generates a text input including a text transcription of a desired speech output, and inputs the text input to a speech synthesis system. The synthesis system generates an audio speech output corresponding to at least a portion of the text input, with at least one portion carrying contrastive stress, and provides the audio speech output for the speech-enabled application. In another aspect, a speech-enabled application inputs a plurality of text strings, each corresponding to a portion of a desired speech output, to a software module for rendering contrastive stress. The software module identifies a plurality of audio recordings that render at least one portion of at least one of the text strings as speech carrying contrastive stress. The speech-enabled application generates an audio speech output corresponding to the desired speech output using the audio recordings.08-18-2011
20110202345METHOD AND APPARATUS FOR GENERATING SYNTHETIC SPEECH WITH CONTRASTIVE STRESS - Techniques for generating synthetic speech with contrastive stress. In one aspect, a speech-enabled application generates a text input including a text transcription of a desired speech output, and inputs the text input to a speech synthesis system. The synthesis system generates an audio speech output corresponding to at least a portion of the text input, with at least one portion carrying contrastive stress, and provides the audio speech output for the speech-enabled application. In another aspect, a speech-enabled application inputs a plurality of text strings, each corresponding to a portion of a desired speech output, to a software module for rendering contrastive stress. The software module identifies a plurality of audio recordings that render at least one portion of at least one of the text strings as speech carrying contrastive stress. The speech-enabled application generates an audio speech output corresponding to the desired speech output using the audio recordings.08-18-2011
20110202344METHOD AND APPARATUS FOR PROVIDING SPEECH OUTPUT FOR SPEECH-ENABLED APPLICATIONS - Techniques for providing speech output for speech-enabled applications. A synthesis system receives from a speech-enabled application a text input including a text transcription of a desired speech output. The synthesis system selects one or more audio recordings corresponding to one or more portions of the text input. In one aspect, the synthesis system selects from audio recordings provided by a developer of the speech-enabled application. In another aspect, the synthesis system selects an audio recording of a speaker speaking a plurality of words. The synthesis system forms a speech output including the one or more selected audio recordings and provides the speech output for the speech-enabled application.08-18-2011
20090094034VOICE INFORMATION RECORDING APPARATUS - A link table is generated, voice information is associated by dot patterns, and then, voice information associated with the dot pattern is reproduced from a speaker when the dot pattern is read by means of a scanner. In this manner, the dot pattern is printed on a surface of a material such as a picture book or a card, making it possible to play back voice information corresponding to a pattern or a story of a picture book and to play back voice information corresponding to a character described on the card. In addition, by means of a link table, new voice information can be associated with, dissociated from, or changed to, a new dot pattern.04-09-2009
20100076766Method for producing indicators and processing apparatus and system utilizing the indicators - The present invention discloses a method for producing graphical indicators and interactive systems for utilizing the graphical indicators. On the surface of an object, visually negligible graphical indicators are provided. The graphical indicators and main information, i.e. text or pictures, co-exist on the surface of object. The graphical indicators do not interfere with the main information when the perception of human eyes are concerned. With the graphical indicators, further information other than the main information on the surface of object are carried. In addition to the main information on the surface of object, one is able to obtain additional information through an auxiliary electronic device or trigger an interactive operation.03-25-2010
20100076767TEXT TO SPEECH CONVERSION OF TEXT MESSAGES FROM MOBILE COMMUNICATION DEVICES - A method includes providing a user interface, at a mobile communication device, that includes a first area to receive text input and a second area to receive an identifier associated with an addressee device. The text input and the identifier are received via the user interface. A short message service (SMS) message including the text input is transmitted to a Text to Speech (TTS) server for conversion into an audio message and for transmission of the audio message to the addressee device associated with the identifier. An acknowledge message transmitted from the TTS server permits the addressee device to allow delivery of the audio message or to decline delivery of the audio message. The TTS server transmits the audio message in response to the addressee device allowing delivery of the audio message. A confirmation message is received from the TTS server that indicates that a reply voice message has been received from the addressee device in response to the audio message.03-25-2010
20100114578Method and Apparatus for Improving Voice recognition performance in a voice application distribution system - A vocabulary management system for constraining voice recognition processing associated with text-to-speech and speech-to-text rendering associated with use of a voice application in progress between a user accessing a data source through a voice portal has a vocabulary management server connected to a voice application server and to a telephony server, and an instance of vocabulary management software running on the management server for enabling vocabulary establishment and management for voice recognition software. The system is characterized in that an administrator accessing the vocabulary management server uses the software to create unique vocabulary sets that are specific to selected portions of vocabulary associated with target data sources the vocabulary sets differing in content according to administrator direction.05-06-2010
20130080172OBJECTIVE EVALUATION OF SYNTHESIZED SPEECH ATTRIBUTES - A method of evaluating attributes of synthesized speech. The method includes processing a text input into a synthesized speech utterance using a processor of a text-to-speech system, applying a human speech utterance to a speech model to obtain a reference wherein the human speech utterance corresponds to the text input, applying the synthesized speech utterance to at least one of the speech model or an other speech model to obtain a test, and calculating a difference between the test and the reference. The method also can be used in a speech synthesis method.03-28-2013
20130080176Methods and Apparatus for Rapid Acoustic Unit Selection From a Large Speech Corpus - A speech synthesis system can select recorded speech fragments, or acoustic units, from a very large database of acoustic units to produce artificial speech. The selected acoustic units are chosen to minimize a combination of target and concatenation costs for a given sentence. However, as concatenation costs, which are measures of the mismatch between sequential pairs of acoustic units, are expensive to compute, processing can be greatly reduced by pre-computing and caching the concatenation costs. The number of possible sequential pairs of acoustic units makes such caching prohibitive. Statistical experiments reveal that while about 85% of the acoustic units are typically used in common speech, less than 1% of the possible sequential pairs of acoustic units occur in practice. The system synthesizes a large body of speech, identifies the acoustic unit sequential pairs generated and their respective concatenation costs, and stores those concatenation costs likely to occur.03-28-2013
20130080175MARKUP ASSISTANCE APPARATUS, METHOD AND PROGRAM - According to one embodiment, a markup assistance apparatus includes an acquisition unit, a first calculation unit, a detection unit and a presentation unit. The acquisition unit acquires feature amount for respective tags, each of the tags being used to control text-to-speech processing of a markup text. The first calculation unit calculates, for respective character strings, a variance of feature amounts of the tags which are assigned to the character string in a markup text. The detection unit detects first character string assigned first tag having the variance not less than a first threshold value as a first candidate including the tag to be corrected. The presentation unit presents the first candidate.03-28-2013
20130080174RETRIEVING DEVICE, RETRIEVING METHOD, AND COMPUTER PROGRAM PRODUCT - In an embodiment, a retrieving device includes: a text input unit, a first extracting unit, a retrieving unit, a second extracting unit, an acquiring unit, and a selecting unit. The text input unit inputs a text including unknown word information representing a phrase that a user was unable to transcribe. The first extracting unit extracts related words representing a phrase related to the unknown word information among phrases other than the unknown word information included in the text. The retrieving unit retrieves a related document representing a document including the related words. The second extracting unit extracts candidate words representing candidates for the unknown word information from a plurality of phrases included in the related document. The acquiring unit acquires reading information representing estimated pronunciation of the unknown word information. The selecting unit selects at least one of candidate word of which pronunciation is similar to the reading information.03-28-2013
20130080173CORRECTING UNINTELLIGIBLE SYNTHESIZED SPEECH - A method and system of speech synthesis. A text input is received in a text-to-speech system and, using a processor of the system, the text input is processed into synthesized speech which is established as unintelligible. The text input is reprocessed into subsequent synthesized speech and output to a user via a loudspeaker to correct the unintelligible synthesized speech. In one embodiment, the synthesized speech can be established as unintelligible by predicting intelligibility of the synthesized speech, and determining that the predicted intelligibility is lower than a minimum threshold. In another embodiment, the synthesized speech can be established as unintelligible by outputting the synthesized speech to the user via the loudspeaker, and receiving an indication from the user that the synthesized speech is not intelligible.03-28-2013
20130085759SPEECH SAMPLES LIBRARY FOR TEXT-TO-SPEECH AND METHODS AND APPARATUS FOR GENERATING AND USING SAME - A method for converting translating text into speech with a speech sample library is provided. The method comprises converting translating an input text to a sequence of triphones; determining musical parameters of each phoneme in the sequence of triphones; detecting, in the speech sample library, speech segments having at least the determined musical parameters; and concatenating the detected speech segments.04-04-2013
20130085758Telecare and/or telehealth communication method and system - A telecare and/or telehealth communication method is described. The method comprises providing predetermined voice messages configured to ask questions to or to give instructions to an assisted individual, providing an algorithm configured to communicate with the assisted individual, and communicating at least one of the predetermined voice messages configured to ask questions to or to give instructions to the assisted individual. The method further comprises analyzing a responsiveness and/or compliance characteristics of the assisted individual, and providing the assisted individual with voice messages in a form most acceptable and effective for the assisted individual on the basis of the analyzed responsiveness and/or the analyzed compliance characteristics.04-04-2013
20130085760TRAINING AND APPLYING PROSODY MODELS - Techniques for training and applying prosody models for speech synthesis are provided. A speech recognition engine processes audible speech to produce text annotated with prosody information. A prosody model is trained with this annotated text. After initial training, the model is applied during speech synthesis to generate speech with non-standard prosody from input text. Multiple prosody models can be used to represent different prosody styles.04-04-2013
20090055186METHOD TO VOICE ID TAG CONTENT TO EASE READING FOR VISUALLY IMPAIRED - A method for providing information to generate distinguishing voices for text content attributable to different authors includes receiving a plurality of text sections each attributable to one of a plurality of authors; identifying which author authored each text section; assigning a unique voice tag id to each author; associating a distinct set of descriptive metadata with each unique voice tag id; and generating a set of speech information for each text section. The set of speech information generated for each text section is based upon the distinct set of descriptive metadata associated with the unique voice tag id assigned to the corresponding author of the text section. The set of speech information generated for each text section is configured to be used by a speech synthesizer to translate the text section into speech in a distinguishing computer-generated voice for the author of the text section.02-26-2009
20130041669SPEECH OUTPUT WITH CONFIDENCE INDICATION - A method, system, and computer program product are provided for speech output with confidence indication. The method includes receiving a confidence score for segments of speech or text to be synthesized to speech. The method includes modifying a speech segment by altering one or more parameters of the speech proportionally to the confidence score.02-14-2013
20130041668VOICE LEARNING APPARATUS, VOICE LEARNING METHOD, AND STORAGE MEDIUM STORING VOICE LEARNING PROGRAM - A voice learning apparatus includes a learning-material voice storage unit that stores learning material voice data including example sentence voice data; a learning text storage unit that stores a learning material text including an example sentence text; a learning-material text display controller that displays the learning material text; a learning-material voice output controller that performs voice output based on the learning material voice data; an example sentence specifying unit that specifies the example sentence text during the voice output; an example-sentence voice output controller that performs voice output based on the example sentence voice data associated with the specified example sentence text; and a learning-material voice output restart unit that restarts the voice output from a position where the voice output is stopped last time, after the voice output is performed based on the example sentence voice data.02-14-2013
20100042410Training And Applying Prosody Models - Techniques for training and applying prosody models for speech synthesis are provided. A speech recognition engine processes audible speech to produce text annotated with prosody information. A prosody model is trained with this annotated text. After initial training, the model is applied during speech synthesis to generate speech with non-standard prosody from input text. Multiple prosody models can be used to represent different prosody styles.02-18-2010
20100324905VOICE MODELS FOR DOCUMENT NARRATION - Disclosed are techniques and systems to provide a narration of a text in multiple different voices. Further disclosed are techniques and systems for modifying a voice model associated with a selected character based on data received from a user.12-23-2010
20080294442APPARATUS, METHOD AND SYSTEM - A method includes obtaining digital content comprising text content; obtaining at least one speech parameter associated with the digital content; and using the speech parameters as an input, generating a speech output corresponding to at least part of the text content. Corresponding apparatuses, system and computer program products are also presented.11-27-2008
20090157409METHOD AND APPARATUS FOR TRAINING DIFFERENCE PROSODY ADAPTATION MODEL, METHOD AND APPARATUS FOR GENERATING DIFFERENCE PROSODY ADAPTATION MODEL, METHOD AND APPARATUS FOR PROSODY PREDICTION, METHOD AND APPARATUS FOR SPEECH SYNTHESIS - A method includes, generating, for each parameter of the prosody vector, an initial parameter prediction model with a plurality of attributes related to difference prosody prediction and at least part of attribute combinations of the plurality of attributes, in which each of the plurality of attributes and the attribute combinations is included as an item, calculating importance of each item in the parameter prediction model, deleting the item having the lowest importance calculated, re-generating a parameter prediction model with the remaining items, determining whether the re-generated parameter prediction model is an optimal model, and repeating the step of calculating importance and the steps following the step of calculating importance with the re-generated parameter prediction model, if the re-generated parameter prediction model is determined as not an optimal model, wherein the difference prosody vector and all parameter prediction models of the difference prosody vector constitute the difference prosody adaptation model.06-18-2009
20090157407Methods, Apparatuses, and Computer Program Products for Semantic Media Conversion From Source Files to Audio/Video Files - An apparatus for semantic media conversion from source data to audio/video data may include a processor. The processor may be configured to parse source data having text and one or more tags and create a semantic structure model representative of the source data, and generate audio data comprising at least one of speech converted from parsed text of the source data contained in the semantic structure model and applied audio effects. Corresponding methods and computer program products are also provided.06-18-2009
20100106506SYSTEMS AND METHODS FOR DOCUMENT NAVIGATION WITH A TEXT-TO-SPEECH ENGINE - A system for visually navigating a document in conjunction with a text-to-speech (“TTS) engine presents a visual display of a region of interest that is related to the text of the document that is being audibly presented as speech to a user. When the TTS engine converts the text to speech and presents the speech to the user, the system presents the corresponding section of text on a display. During the presentation, if the system encounters a linked section of text, the visual display changes to display a linked region of interest that corresponds to the linked section of text.04-29-2010
20120185253EXTRACTING TEXT FOR CONVERSION TO AUDIO - Embodiments are disclosed that relate to converting markup content to an audio output. For example, one disclosed embodiment provides, in a computing device a method including partitioning a markup document into a plurality of content panels, and forming a subset of content panels by filtering the plurality of content panels based upon geometric and/or location-based criteria of each panel relative to an overall organization of the markup document. The method further includes determining a document object model (DOM) analysis value for each content panel of the subset of content panels, identifying a set of content panels determined to contain text body content by filtering the subset of content panels based upon the DOM analysis value of each of the content panels of the subset of content panels, and converting text in a selected content panel determined to contain text body content to an audio output.07-19-2012
20090125309Methods, Systems, and Products for Synthesizing Speech - Methods, Systems, and Products are disclosed for synthesizing speech. Text is received for translation to speech. The text is correlated to phrases, and each phrase is converted into a corresponding string of phonemes. A phoneme identifier is retrieved that uniquely represents each phoneme in the string of phonemes. Each phoneme identifier is concatenated to produce a sequence of phoneme identifiers with each phoneme identifier separated by a comma. Each sequence of phoneme identifiers is concatenated and separated by a semi-colon.05-14-2009
20130046541APPARATUS FOR ASSISTING VISUALLY IMPAIRED PERSONS TO IDENTIFY PERSONS AND OBJECTS AND METHOD FOR OPERATION THEREOF - An apparatus for assisting visually impaired persons includes a headset. A camera is mounted on the headset. A microprocessor communicates with the camera for receiving an optically read code captured by the camera and converting the optically read code to an audio signal as a function of a trigger contained within the optical code. A speaker communicating with the processor outputs the audio signal.02-21-2013
20090254346AUTOMATED VOICE ENABLEMENT OF A WEB PAGE - Embodiments of the present invention provide a method, system and computer program product for the automated voice enablement of a Web page. In an embodiment of the invention, a method for voice enabling a Web page can include selecting an input field of a Web page for speech input, generating a speech grammar for the input field based upon terms in a core attribute of the input field, receiving speech input for the input field, posting the received speech input and the grammar to an automatic speech recognition (ASR) engine and inserting a textual equivalent to the speech input provided by the ASR engine into a document object model (DOM) for the Web page.10-08-2009
20090043583DYNAMIC MODIFICATION OF VOICE SELECTION BASED ON USER SPECIFIC FACTORS - The present invention discloses a solution for customizing synthetic voice characteristics in a user specific fashion. The solution can establish a communication between a user and a voice response system. A data store can be searched for a speech profile associated with the user. When a speech profile is found, a set of speech output characteristics established for the user from the profile can be determined. Parameters and settings of a text-to-speech engine can be adjusted in accordance with the determined set of speech output characteristics. During the established communication, synthetic speech can be generated using the adjusted text-to-speech engine. Thus, each detected user can hear a synthetic speech generated by a different voice specifically selected for that user. When no user profile is detected, a default voice or a voice based upon a user's speech or communication details can be used.02-12-2009
20090319273AUDIO CONTENT GENERATION SYSTEM, INFORMATION EXCHANGING SYSTEM, PROGRAM, AUDIO CONTENT GENERATING METHOD, AND INFORMATION EXCHANGING METHOD - An audio content generation system is a system for generating audio contents including a voice synthesis unit 12-24-2009
20090313022SYSTEM AND METHOD FOR AUDIBLY OUTPUTTING TEXT MESSAGES - A method and system for audibly outputting text messages includes: setting a vocalizing function for audibly outputting text messages, searching a character speech library for each character of a received text message, and acquiring pronunciation data of each character of the received text message. The method and the system further includes vocalizing the pronunciation data of each character of the received text message, generating a voice message, and audibly outputting the generated voice message.12-17-2009
20090306987SINGING SYNTHESIS PARAMETER DATA ESTIMATION SYSTEM - There is provided a singing synthesis parameter data estimation system that automatically estimates singing synthesis parameter data for automatically synthesizing a human-like singing voice from an audio signal of input singing voice. A pitch parameter estimating section 12-10-2009
20120191457METHODS AND APPARATUS FOR PREDICTING PROSODY IN SPEECH SYNTHESIS - Techniques for predicting prosody in speech synthesis may make use of a data set of example text fragments with corresponding aligned spoken audio. To predict prosody for synthesizing an input text, the input text may be compared with the data set of example text fragments to select a best matching sequence of one or more example text fragments, each example text fragment in the sequence being paired with a portion of the input text. The selected example text fragment sequence may be aligned with the input text, e.g., at the word level, such that prosody may be extracted from the audio aligned with the example text fragments, and the extracted prosody may be applied to the synthesis of the input text using the alignment between the input text and the example text fragments.07-26-2012
20130073288Wireless Server Based Text to Speech Email - An email system for mobile devices, such as cellular phones and PDAs, is disclosed which allows email messages to be played back on the mobile device as voice messages on demand by way of a media player, thus eliminating the need for a unified messaging system. Email messages are received by the mobile device in a known manner. In accordance with an important aspect of the invention, the email messages are identified by the mobile device as they are received. After the message is identified, the mobile device sends the email message in text format to a server for conversion to speech or voice format. After the message is converted to speech format, the server sends the messages back to the user's mobile device and notifies the user of the email message and then plays the message back to the user through a media player upon demand.03-21-2013
20130073287VOICE PRONUNCIATION FOR TEXT COMMUNICATION - A method, computer program product, and system for voice pronunciation for text communication is described. A selected portion of a text communication is determined. A prompt to record a pronunciation relating to the selected portion of the text communication is provided at a first computing device. The recorded pronunciation is associated with the selected portion of the text communication. A visual indicator, relating to the selected portion of the text communication and the recorded pronunciation, is displayed.03-21-2013
20110015930UNIFIED COMMUNICATION SYSTEM - A unified communication system is disclosed that allows a variety of end point types to participate in a communication event using a common, unified communication system. In some implementations, a calling party interacts with a client application residing on an endpoint to make a communication request to another endpoint. A communication event manager residing in the unified communication system selects a script from a repository of scripts based on the communication event and the capabilities of the endpoints. A communication event execution engine receives a user profile associated with at least one of the endpoints. The user profile can be configured by the user to describe the user's preferences for how the communication should be processed by the unified communication system.01-20-2011
20110015929TRANSFORMING A TACTUALLY SELECTED USER INPUT INTO AN AUDIO OUTPUT - A contextual input device includes a plurality of tactually discernable keys disposed in a predetermined configuration which replicates a particular relationship among a plurality of items associated with a known physical object. The tactually discernable keys are typically labeled with Braille type. The known physical object is typically a collection of related items grouped together by some common relationship. A computer-implemented process determines whether a input signal represents a selection of an item from among a plurality of items or an attribute pertaining to an item among the plurality of items. Once the selected item or attribute pertaining to an item is determined, the computer-implemented process transforms a user's selection from the input signal into an analog audio signal which is then audibly output as human speech with an electro-acoustic transducer.01-20-2011
20090271202SPEECH SYNTHESIS APPARATUS, SPEECH SYNTHESIS METHOD, SPEECH SYNTHESIS PROGRAM, PORTABLE INFORMATION TERMINAL, AND SPEECH SYNTHESIS SYSTEM - A speech synthesis apparatus includes a content selection unit that selects a text content item to be converted into speech; a related information selection unit that selects related information which can be at least converted into text and which is related to the text content item selected by the content selection unit; a data addition unit that converts the related information selected by the related information selection unit into text and adds text data of the text to text data of the text content item selected by the content selection unit; a text-to-speech conversion unit that converts the text data supplied from the data addition unit into a speech signal; and a speech output unit that outputs the speech signal supplied from the text-to-speech conversion unit.10-29-2009
20120226501Document Navigation Method - A document navigation tool that automatically navigates a document based on previous input from the user. The document navigation tool is utilized each time a page loads. The method recognizes user behavior on pages using patterns, which are based on four criterion: location, frequency, consistency, and scope. If the user has visited the page previously and has established a pattern, the method automatically focuses on the portion of the page indicated by the pattern, e.g. the location on a web page of the link clicked by the user in the user's last three visits to the page. If the user has not visited the page previously, the method logs the events that occur during this visit to the page.09-06-2012
20120226500SYSTEM AND METHOD FOR CONTENT RENDERING INCLUDING SYNTHETIC NARRATION - A system and method for capturing a voice information and using the voice information to modulate a content output signal. The method for capturing voice information includes receiving a request to create speech modulation and presenting a piece of textual content operable for use in creating the speech modulation based on the textual input. The method further includes receiving a first voice sample and determining a voice fingerprint based on said first voice sample. The voice fingerprint is operable for modulating speech during content rendering (e.g., audio output) such that a synthetic narration is performed based on the textual input. The voice fingerprint may then be stored and used for modulating the output.09-06-2012
20130066632SYSTEM AND METHOD FOR ENRICHING TEXT-TO-SPEECH SYNTHESIS WITH AUTOMATIC DIALOG ACT TAGS - Disclosed herein are systems, methods, and non-transitory computer-readable storage media for modifying the prosody of synthesized speech based on an associated speech act. A system configured according to the method embodiment (1) receives text, (2) performs an analysis of the text to determine and assign a speech act label to the text, and (3) converts the text to speech, where the speech prosody is based on the speech act label. The analysis performed compares the text to a corpus of previously tagged utterances to find a close match, determines a confidence score from a correlation of the text and the close match, and, if the confidence score is above a threshold value, retrieving the speech act label of the close match and assigning it to the text.03-14-2013
20090012793TEXT-TO-SPEECH ASSIST FOR PORTABLE COMMUNICATION DEVICES - The present invention provides a text-to-speech assist for portable communication devices. A method for communicating text data using a portable communication device in accordance with the present invention includes: displaying text data on a display of the portable communication device while communicating with a party; selecting at least a portion of the displayed text data; converting the selected text data into synthesized speech; and providing the synthesized speech to the party using the portable communication device.01-08-2009
20100088099Reducing Processing Latency in Optical Character Recognition for Portable Reading Machine - A portable reading device includes a computing device and a computer readable medium storing a computer program product to receive an image and select a section of the image to process. The product processes the section of the image with a first process and when the first process is finished processing the section of the image, process a result of the first process with a second process. While the second process is processing, repeats the first process on another section of the image.04-08-2010
20130166304SYNCHRONISE AN AUDIO CURSOR AND A TEXT CURSOR DURING EDITING - A speech recognition device (06-27-2013
20080294443APPLICATION OF EMOTION-BASED INTONATION AND PROSODY TO SPEECH IN TEXT-TO-SPEECH SYSTEMS - Abstract of the Disclosure A text-to-speech system that includes an arrangement for accepting text input, an arrangement for providing synthetic speech output, and an arrangement for imparting emotion-based features to synthetic speech output. The arrangement for imparting emotion-based features includes an arrangement for accepting instruction for imparting at least one emotion-based paradigm to synthetic speech output, as well as an arrangement for applying at least one emotion-based paradigm to synthetic speech output.11-27-2008
20090018839Personal Virtual Assistant - A computer-based virtual assistant includes a virtual assistant application running on a computer capable of receiving human voice communications from a user of a remote user interface and transmitting a vocalization to the remote user interface, the virtual assistant application enabling the user to access email and voicemail messages of the user, the virtual assistant application selecting a responsive action to a verbal query or instruction received from the remote user interface and transmitting a vocalization characterizing the selected responsive action to the remote user interface, and the virtual assistant waiting a predetermined period of time, and if no canceling indication is received from the remote user interface, proceeding to perform the selected responsive action, and if a canceling indication is received from the remote user interface halting the selected responsive action and transmitting a new vocalization to the remote user interface. Also a method of using the virtual assistant.01-15-2009
20100004933VOICE DIRECTED SYSTEM AND METHOD CONFIGURED FOR ASSURED MESSAGING TO MULTIPLE RECIPIENTS - A communications system transmits messages via a wireless network to multiple users nearly simultaneously in real-time. Each user has a terminal that receives a message and plays the message for the user. The terminal may also wait for the user to verbally acknowledge the arrival of the message before continuing with its normally executing application. The sender of the message may track, for each intended recipient, the delivery of the message, the accessing of the message by the user, and the acknowledgement by the user that the message was understood.01-07-2010
20110282668SPEECH ADAPTATION IN SPEECH SYNTHESIS - A method of and system for speech synthesis. First and second text inputs are received in a text-to-speech system, and processed into respective first and second speech outputs corresponding to stored speech respectively from first and second speakers using a processor of the system. The second speech output of the second speaker is adapted to sound like the first speech output of the first speaker.11-17-2011
20090063154EMOTIVE TEXT-TO-SPEECH SYSTEM AND METHOD - Information about a device may be emotively conveyed to a user of the device. Input indicative of an operating state of the device may be received. The input may be transformed into data representing a simulated emotional state. Data representing an avatar that expresses the simulated emotional state may be generated and displayed. A query from the user regarding the simulated emotional state expressed by the avatar may be received. The query may be responded to.03-05-2009
20100125459STOCHASTIC PHONEME AND ACCENT GENERATION USING ACCENT CLASS - Exemplary embodiments provide for determining a sequence of words in a TTS system. An input text is analyzed using two models, a word n-gram model and an accent class n-gram model. A list of all possible words for each word in the input is generated for each model. Each word in each list for each model is given a score based on the probability that the word is the correct word in the sequence, based on the particular model. The two lists are combined and the two scores are combined for each word. A set of sequences of words are generated. Each sequence of words comprises a unique combination of an attribute and associated word for each word in the input. The combined score of each of word in the sequence of words is combined. A sequence of words having the highest score is selected and presented to a user.05-20-2010
20090138268DATA PROCESSING DEVICE AND COMPUTER-READABLE STORAGE MEDIUM STORING SET OF PROGRAM INSTRUCTIONS EXCUTABLE ON DATA PROCESSING DEVICE - A data processing device includes a displaying unit, a receiving unit, a determining unit, and a controlling unit. The displaying unit displaying one of a first operation screen and a second operation screen. Input data is inputted into the receiving unit by a user. The determining unit determines, based on at least one of the input data and settings of an OS, which of the first operation screen and the second operation screen should be displayed on the displaying unit. The controlling unit controls the displaying unit to display the first operation screen if the determining unit determines that the first operation screen should be displayed on the displaying unit, and control the displaying unit to display the second operation screen if the determining unit determines that the second operation screen should be displayed on the displaying unit.05-28-2009
20110295606CONTEXTUAL CONVERSION PLATFORM - A contextual conversion platform, and method for converting text-to-speech, are described that can convert content of a target to spoken content. Embodiments of the contextual conversion platform can identify certain contextual characteristics of the content, from which can be generated a spoken content input. This spoken content input can include tokens, e.g., words and abbreviations, to be converted to the spoken content, as well as substitution tokens that are selected from contextual repositories based on the context identified by the contextual conversion platform.12-01-2011
20090043584System and method for phonetic representation - A method for generating an Approximate Phonetic Representation (APR) of a given word, the word having a sequence of characters, the method comprising: Receiving the word; Generating the APR by applying at least one metaphone3 translation rule to encode one or more of the characters of the given word into a resulting APR; and Returning either the generated APR and/or one or more words matching the APR from a dictionary of words.02-12-2009
20100145703Portable Code Recognition Voice-Outputting Device - The present invention relates to a code recognition voice-outputting device, in which a digital code image of a predetermined compression type is recognized, and the recognized image is converted into voice to be output to the outside. The apparatus includes a reader as a scanning unit for recognizing a compressed digital code image, and a player for processing the digital code image read from the reader, and converting the processed code image into voice to be output to the outside, wherein the reader and the player are configured to be capable of being separated from each other. The present invention further provides a code recognition voice-outputting device which supports a variety of functions and provides a voice guide function for all menus and operating statuses that support the functions for the sake of eyesight handicapped, illiterates, the aged, etc., thereby promoting user convenience.06-10-2010
20090177473APPLYING VOCAL CHARACTERISTICS FROM A TARGET SPEAKER TO A SOURCE SPEAKER FOR SYNTHETIC SPEECH - A computer implemented method, system and computer usable program code for synthesizing speech. A computer implemented method for synthesizing speech includes providing a database of speech of a source speaker, and providing a prosody model of speech of a target speaker different from the source speaker. Text input to be synthesized is received, and the prosody model of speech of the target speaker is applied to the text input to select segments of the speech of the source speaker in the database to form synthesized speech of the text input. The synthesized speech of the text input is then output.07-09-2009
20090018837SPEECH PROCESSING APPARATUS AND METHOD - A speech processing apparatus which can playback a sentence using recorded-speech-playback or text-to-speech is provided. It is determined whether each of a plurality of words or phrases constituting a sentence is a word or phrase to be played back by recorded-speech-playback or a word or phrase to be played back by text-to-speech. When each of the plurality of words or phrases is to be played back in a first sequence using the determined synthesis method, it is selected whether to playback each of the plurality of words or phrases in the first sequence or a sequence different from the first sequence, based on the number of times of reversing playback using recorded-speech-playback and playback using text-to-speech. Each of the plurality of words or phrases is played back in the selected sequence using the selected synthesis method.01-15-2009
20090319274System and Method for Verifying Origin of Input Through Spoken Language Analysis - An audible based electronic challenge system is used to control access to a computing resource by using a test to identify an origin of a voice. The test is based on analyzing a spoken utterance to determine if it was articulated by an unauthorized human or a text to speech (TTS) system.12-24-2009
20100010815FACILITATING TEXT-TO-SPEECH CONVERSION OF A DOMAIN NAME OR A NETWORK ADDRESS CONTAINING A DOMAIN NAME - To facilitate text-to-speech conversion of a username, a first or last name of a user associated with the username may be retrieved, and a pronunciation of the username may be determined based at least in part on whether the name forms at least part of the username. To facilitate text-to-speech conversion of a domain name having a top level domain and at least one other level domain, a pronunciation for the top level domain may be determined based at least in part upon whether the top level domain is one of a predetermined set of top level domains. Each other level domain may be searched for one or more recognized words therewithin, and a pronunciation of the other level domain may be determined based at least in part on an outcome of the search. The username and domain name may form part of a network address such as an email address, URL or URI.01-14-2010
20100268539SYSTEM AND METHOD FOR DISTRIBUTED TEXT-TO-SPEECH SYNTHESIS AND INTELLIGIBILITY - A method and system for distributed text-to-speech synthesis and intelligibility, and more particularly to distributed text-to-speech synthesis on handheld portable computing devices that can be used for example to generate intelligible audio prompts that help a user interact with a user interface of the handheld portable computing device. The text-to-speech distributed system 10-21-2010
20090006097Pronunciation correction of text-to-speech systems between different spoken languages - Pronunciation correction for text-to-speech (TTS) systems and speech recognition (SR) systems between different languages is provided. If a word requiring pronunciation by a target language TTS or SR is from a same language as the target language, but is not found in a lexicon of words from the target language, a letter-to-speech (LTS) rules set of the target language is used to generate a letter-to-speech output for the word for use by the TTS or SR configured according to the target language. If the word is from a different language as the target language, phonemes comprising the word according to its native language are mapped to phonemes of the target language. The phoneme mapping is used by the TTS or SR configured according to the target language for generating or recognizing an audible form of the word according to the target language.01-01-2009
20120109655WIRELESS SERVER BASED TEXT TO SPEECH EMAIL - An email system for mobile devices, such as cellular phones and PDAs, is disclosed which allows email messages to be played back on the mobile device as voice messages on demand by way of a media player, thus eliminating the need for a unified messaging system. Email messages are received by the mobile device in a known manner. In accordance with an important aspect of the invention, the email messages are identified by the mobile device as they are received. After the message is identified, the mobile device sends the email message in text format to a server for conversion to speech or voice format. After the message is converted to speech format, the server sends the messages back to the user's mobile device and notifies the user of the email message and then plays the message back to the user through a media player upon demand.05-03-2012
20120109654METHODS AND APPARATUSES FOR FACILITATING SPEECH SYNTHESIS - Methods and apparatuses are provided for facilitating speech synthesis. A method may include generating a plurality of input models representing an input by using a statistical model synthesizer to statistically model the input. The method may further include determining a speech unit sequence representing at least a portion of the input by using the input models to influence selection of one or more pre-recorded speech units having parameter representations. The method may additionally include identifying one or more bad units in the unit sequence. The method may also include replacing the identified one or more bad units with one or more parameters generated by the statistical model synthesizer. Corresponding apparatuses are also provided.05-03-2012
20120035934SPEECH GENERATION DEVICE WITH A PROJECTED DISPLAY AND OPTICAL INPUTS - In several embodiments, a speech generation device is disclosed. The speech generation device may generally include a projector configured to project images in the form of a projected display onto a projection Surface, an optical input device configured to detect an input directed towards the projected display and a speaker configured to generate an audio output. In addition, the speech generation device may include a processing unit communicatively coupled to the projector, the optical input device and the speaker. The processing unit may include a processor and related computer readable medium configured to store instructions executable by the processor, wherein the instructions stored on the computer readable medium configure the speech generation device to generate text-to-speech output.02-09-2012
20120035933SYSTEM AND METHOD FOR SYNTHETIC VOICE GENERATION AND MODIFICATION - Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating a synthetic voice. A system configured to practice the method combines a first database of a first text-to-speech voice and a second database of a second text-to-speech voice to generate a combined database, selects from the combined database, based on a policy, voice units of a phonetic category for the synthetic voice to yield selected voice units, and synthesizes speech based on the selected voice units. The system can synthesize speech without parameterizing the first text-to-speech voice and the second text-to-speech voice. A policy can define, for a particular phonetic category, from which text-to-speech voice to select voice units. The combined database can include multiple text-to-speech voices from different speakers. The combined database can include voices of a single speaker speaking in different styles. The combined database can include voices of different languages.02-09-2012
20100082345SPEECH AND TEXT DRIVEN HMM-BASED BODY ANIMATION SYNTHESIS - An “Animation Synthesizer” uses trainable probabilistic models, such as Hidden Markov Models (HMM), Artificial Neural Networks (ANN), etc., to provide speech and text driven body animation synthesis. Probabilistic models are trained using synchronized motion and speech inputs (e.g., live or recorded audio/video feeds) at various speech levels, such as sentences, phrases, words, phonemes, sub-phonemes, etc., depending upon the available data, and the motion type or body part being modeled. The Animation Synthesizer then uses the trainable probabilistic model for selecting animation trajectories for one or more different body parts (e.g., face, head, hands, arms, etc.) based on an arbitrary text and/or speech input. These animation trajectories are then used to synthesize a sequence of animations for digital avatars, cartoon characters, computer generated anthropomorphic persons or creatures, actual motions for physical robots, etc., that are synchronized with a speech output corresponding to the text and/or speech input.04-01-2010
20100100385System and Method for Testing a TTS Voice - Disclosed are various elements of a toolkit used for generating a TTS voice for use in a spoken dialog system. The invention in each case may be in the form of the system, a computer-readable medium or a method for generating the TTS voice. An embodiment of the invention relates to a method for preparing a text-to-speech (TTS) voice for testing and verification. The method comprises processing a TTS voice to be ready for testing, synthesizing words utilizing the TTS voice, presenting to a person a smallest possible subset that contains at least N instances of a group of units in the TTS voice, receiving information from the person associated with corrections needed to the TTS voice and making corrections to the TTS voice according to the received information.04-22-2010
20090119108AUDIO-BOOK PLAYBACK METHOD AND APPARATUS - An audio-book playback method includes buffering text data that is to be played back by speech, converting the buffered text data to speech data, performing speech-playback by using the speech data, and buffering next text data for continuous playback. The provided audio-book playback method and an apparatus enable a user to enjoy reading a book while also listening to content of the book being voiced by a multimedia playback device. Moreover, double buffering technology is employed to provide seamless text and speech-playback services.05-07-2009
20120296654SYSTEMS AND METHODS FOR DYNAMICALLY IMPROVING USER INTELLIGIBILITY OF SYNTHESIZED SPEECH IN A WORK ENVIRONMENT - Method and apparatus that dynamically adjusts operational parameters of a text-to-speech engine in a speech-based system. A voice engine or other application of a device provides a mechanism to alter the adjustable operational parameters of the text-to-speech engine. In response to one or more environmental conditions, the adjustable operational parameters of the text-to-speech engine are modified to increase the intelligibility of synthesized speech.11-22-2012
20090157408SPEECH SYNTHESIZING METHOD AND APPARATUS - The present invention relates to a speech synthesizing method and apparatus based on a hidden Markov model (HMM). Among code words that are obtained by quantizing speech parameter instances for each state of an HMM model, a code word closest to a speech parameter generated from an input text using a known method is searched. When the distance between the searched code word and the speech parameter generated by the known method is smaller to or equal to a threshold value, the searched code word is output as a final speech parameter. When the distance exceeds the threshold value, the speech parameter generated by the known method is output as the final speech parameter. The final speech parameter is processed to generate final synthesized speech for the input text.06-18-2009
20080249776Methods and Arrangements for Enhancing Machine Processable Text Information - The invention relates to methods and arrangements for enhancing machine processable text information which is provided by at least machine processable text data. On the basis of synthetic speech, i.e. speech generated by a machine, prosody-related information and/or text-related information is determined and added to given text information.10-09-2008
20080288257APPLICATION OF EMOTION-BASED INTONATION AND PROSODY TO SPEECH IN TEXT-TO-SPEECH SYSTEMS - A text-to-speech system that includes an arrangement for accepting text input, an arrangement for providing synthetic speech output, and an arrangement for imparting emotion-based features to synthetic speech output. The arrangement for imparting emotion-based features includes an arrangement for accepting instruction for imparting at least one emotion-based paradigm to synthetic speech output, as well as an arrangement for applying at least one emotion-based paradigm to synthetic speech output.11-20-2008
20090265172INTEGRATED SYSTEM AND METHOD FOR MOBILE AUDIO PLAYBACK AND DICTATION - A method and system provides for a single-pass review and feedback of a document. During audio playback of the document to be reviewed, voice-activated recording of feedback and submission of feedback relative to the location in the original document are accomplished. This provides for a fully integrated, single pass review and feedback of documentation to occur.10-22-2009
20080312931SPEECH SYNTHESIS METHOD, SPEECH SYNTHESIS SYSTEM, AND SPEECH SYNTHESIS PROGRAM - A speech synthesis system stores a group of speech units in a memory, selects a plurality of speech units from the group based on prosodic information of target speech, the speech units selected corresponding to each of segments which are obtained by segmenting a phoneme string of the target speech and minimizing distortion of synthetic speech generated from the speech units selected to the target speech, generates a new speech unit corresponding to the each of the segments, by fusing the speech units selected, to obtain a plurality of new speech units corresponding to the segments respectively, and generates synthetic speech by concatenating the new speech units.12-18-2008
20080312930METHOD AND SYSTEM FOR ALIGNING NATURAL AND SYNTHETIC VIDEO TO SPEECH SYNTHESIS - According to MPEG-12-18-2008
20080243511Speech synthesizer - The present invention is a speech synthesizer that generates speech data of text including a fixed part and a variable part, in combination with recorded speech and rule-based synthetic speech. The speech synthesizer is a high-quality one in which recorded speech and synthetic speech are concatenated with the discontinuity of timbres and prosodies not perceived. The speech synthesizer includes: a recorded speech database that previously stores recorded speech data including a recorded fixed part; a rule-based synthesizer that generates rule-based synthetic speech data including a variable part and at least part of the fixed part, from received text; a concatenation boundary calculator that a concatenation boundary position in a region in which the recorded speech data and the rule-based synthetic speech data overlap, based on acoustic characteristics of the recorded speech data and the rule-based synthetic speech data that correspond to the text; a concatenative synthesizer that generates synthetic speech data corresponding to the text by concatenating the recorded speech data and the rule-based synthetic speech data that are segmented in the concatenation boundary position.10-02-2008
20080270138AUDIO CONTENT SEARCH ENGINE - A method of generating an audio content index for use by a search engine includes determining a phoneme sequence based on recognized speech from an audio content time segment. The method also includes identifying k-phonemes which occur within the phoneme sequence. The identified k-phonemes are stored within a data structure such that the identified k-phonemes are capable of being compared with k-phonemes from a search query.10-30-2008
20080312929USING FINITE STATE GRAMMARS TO VARY OUTPUT GENERATED BY A TEXT-TO-SPEECH SYSTEM - The present invention discloses a text-to-speech system that provides output variability. The system can include a finite state grammar, a variability engine and a text-to-speech engine. The finite state grammar can contain a phrase role consisting of one or more phrase elements. The phrase rule can deterministically generate a variable text phrase based upon at least one random number. The phrase rule can include a definition for each of the phrase elements. Each definition can be associated with at least one defined text string. The variability engine can construct a random text phrase responsive to receiving an action command, wherein said finite state grammar is used to create the text phrase. The variability engine can also rely on user-specified weights to adjust the output probabilities. The speech-to-text engine can convert the text phrase generated by the variability engine into speech output.12-18-2008
20080270137TEXT TO SPEECH INTERACTIVE VOICE RESPONSE SYSTEM - A text to speech interactive voice response system is operable within a personal computer having a processor, data storage means and an operating system. The system comprises an input subsystem for receiving a text data stream from a source device in a predetermined format; a process control subsystem for converting the text data stream into corresponding output data items; an audio record subsystem for recording audio data to be associated with each output data item; and, a broadcast control subsystem for generating an audio broadcast based on the output data items. There is also disclosed a system management and control subsystem for user interface with the system.10-30-2008
20080270139Converting text-to-speech and adjusting corpus - The present invention provides a method and apparatus for text to speech conversion, and a method and apparatus for adjusting a corpus. The method for text to speech comprises: text analysis step for parsing the text to obtain descriptive prosody annotations of the text based on a TTS model generated from a first corpus; prosody parameter prediction step for predicting the prosody parameter of the text according to the result of text analysis step; speech synthesis step for synthesizing speech of said text based on said the prosody parameter of the text; wherein descriptive prosody annotations of the text include prosody structure for the text, the prosody structure of the text is adjusted according to a target speech speed for the synthesized speech. The present invention adjusts the prosody structure of the text according to the target speech speed. The synthesized speech will have improved quality.10-30-2008
20110270613INFERRING SWITCHING CONDITIONS FOR SWITCHING BETWEEN MODALITIES IN A SPEECH APPLICATION ENVIRONMENT EXTENDED FOR INTERACTIVE TEXT EXCHANGES - The disclosed solution includes a method for dynamically switching modalities based upon inferred conditions in a dialogue session involving a speech application. The method establishes a dialogue session between a user and the speech application. During the dialogue session, the user interacts using an original modality and a second modality. The speech application interacts using a speech modality only. A set of conditions indicative of interaction problems using the original modality can be inferred. Responsive to the inferring step, the original modality can be changed to the second modality. A modality transition to the second modality can be transparent the speech application and can occur without interrupting the dialogue session. The original modality and the second modality can be different modalities; one including a text exchange modality and another including a speech modality.11-03-2011
20100153115Human-Assisted Pronunciation Generation - Pronunciation generation may be provided. First, a pronunciation interface may be provided. The pronunciation interface may be configured to display a word and a plurality of alternatives corresponding to a one of a plurality of parts of the word. The plurality of parts may comprise phonemes or syllables of the word. Next, pronunciation data may be received through the pronunciation interface. The pronunciation data may indicate one of the plurality of alternatives. Then a pronunciation of the word may be generated based upon the received pronunciation data. The pronunciation may correspond to the indicated one of the plurality of alternatives. In addition, the pronunciation data may indicate which one of the plurality of parts of the word is stressed. This stress indication may be received in response to a user sliding a user selectable element to indicate which one of the plurality of parts of the word is stressed.06-17-2010
20120197646Open Architecture For a Voice User Interface - A system and method for processing voice requests from a user for accessing information on a computerized network and delivering information from a script server and an audio server in the network in audio format. A voice user interface subsystem includes: a dialog engine that is operable to interpret requests from users from the user input, communicate the requests to the script server and the audio server, and receive information from the script server and the audio server; a media telephony services (MTS) server, wherein the MTS server is operable to receive user input via a telephony system, and to transfer the user input to the dialog engine; and a broker coupled between the dialog engine and the MTS server. The broker establishes a session between the MTS server and the dialog engine and controls telephony functions with the telephony system.08-02-2012
20090177475SPEECH SYNTHESIS DEVICE, METHOD, AND PROGRAM - Even when a pitch cycle has a large fluctuation and the pitch cycle string changes abruptly, it possible to suppress the affect of the pitch cycle fluctuation and generate high-quality synthesized speech. A speech synthesis device generates a synthesized speech corresponding to an input text sentence according to an original speech waveform stored in original speech waveform information storage unit (07-09-2009
20090177474SPEECH PROCESSING APPARATUS AND PROGRAM - A speech synthesizer includes a periodic component fusing unit and an aperiodic component fusing unit, and fuses periodic components and aperiodic components of a plurality of speech units for each segment, which are selected by a unit selector, by a periodic component fusing unit and an aperiodic component fusing unit, respectively. The speech synthesizer is further provided with an adder, so that the adder adds, edits, and concatenates the periodic components and the aperiodic components of the fused speech units to generate a speech waveform.07-09-2009
20090187408SPEECH INFORMATION PROCESSING APPARATUS AND METHOD - A temporary child set is generated. An elastic ratio of an elastic section of a model pattern is calculated. A temporary typical pattern of the set is generated by combining the pattern belonging to the set with the model pattern having the elastic pattern expanded or contracted. A distortion between the temporary typical pattern of the set and the pattern belonging to the set is calculated, and a child set is determined as the set when the distortion is below a threshold. A typical pattern as the temporary typical pattern of the child set is stored with a classification rule as the classification item of the context of the pattern belonging to the child set.07-23-2009
20090187407System and methods for reporting - The present invention relates to a system and methods for preparing reports, such as medical reports. The system and methods advantageously can verbalize information, using speech synthesis (text-to-speech), to support a dialogue between a user and the reporting system during the course of the preparation of the report in order that the user can avoid inefficient visual distractions.07-23-2009
20090094035METHOD AND SYSTEM FOR PRESELECTION OF SUITABLE UNITS FOR CONCATENATIVE SPEECH - A system and method for improving the response time of text-to-speech synthesis utilizes “triphone contexts” (i.e., triplets comprising a central phoneme and its immediate context) as the basic unit, instead of performing phoneme-by-phoneme synthesis. The method comprises a method of generating a triphone preselection cost database for use in speech synthesis, the method comprising 1) selecting a triphone sequence u04-09-2009
20090144060System and Method for Generating a Web Podcast Service - Disclosed is a system and method for generating a web podcast interview that allows a single user to create his own multi-voices interview from his computer. The method allows the user to enter a set of questions from a text file using a text editor. (Answers may also be entered from a text file although this is not the more preferred embodiment.) For each question, the user may select one particular interviewer voice among a plurality of predefined interviewer voices, and by using a text-to-speech module in a text-to-speech server, each question is converted into an audio question having the selected interviewer voice. Then, the user preferably records answers to each audio question using a telephone. And a questions/answers sequence in a podcast compliant format is generated.06-04-2009
20130218566AUDIO HUMAN INTERACTIVE PROOF BASED ON TEXT-TO-SPEECH AND SEMANTICS - The text-to-speech audio HIP technique described herein in some embodiments uses different correlated or uncorrelated words or sentences generated via a text-to-speech engine as audio HIP challenges. The technique can apply different effects in the text-to-speech synthesizer speaking a sentence to be used as a HIP challenge string. The different effects can include, for example, spectral frequency warping; vowel duration warping; background addition; echo addition; and varying the time duration between words, among others. In some embodiments the technique varies the set of parameters to prevent using Automated Speech Recognition tools from using previously used audio HIP challenges to learn a model which can then be used to recognize future audio HIP challenges generated by the technique. Additionally, in some embodiments the technique introduces the requirement of semantic understanding in HIP challenges.08-22-2013
20130218567APPARATUS FOR TEXT-TO-SPEECH DELIVERY AND METHOD THEREFOR - A method and apparatus for determining the manner in which a processor-enabled device should produce sounds from data is described. The device ideally synthesizes sounds digitally, and reproduces pre-recorded sounds, together with an audible delivery thereof, a memory in which is stored a database of a plurality data at least some of which is in the form of text-based indicators, and one or more pre-recorded sounds device is further capable of repeatedly determining one or more physical conditions, e.g. current GPS location, which is compared with one or more reference values provided in memory such that a positive result of the comparison gives rise to an event requiring a sound to be produced by the device.08-22-2013
20130218569TEXT-TO-SPEECH USER'S VOICE COOPERATIVE SERVER FOR INSTANT MESSAGING CLIENTS - A system and method to allow an author of an instant message to enable and control the production of audible speech to the recipient of the message. The voice of the author of the message is characterized into parameters compatible with a formative or articulative text-to-speech engine such that upon receipt, the receiving client device can generate audible speech signals from the message text according to the characterization of the author's voice. Alternatively, the author can store samples of his or her actual voice in a server so that, upon transmission of a message by the author to a recipient, the server extracts the samples needed only to synthesize the words in the text message, and delivers those to the receiving client device so that they are used by a client-side concatenative text-to-speech engine to generate audible speech signals having a close likeness to the actual voice of the author.08-22-2013
20090326948Automated Generation of Audiobook with Multiple Voices and Sounds from Text - A method, system and computer-usable medium are disclosed for the transcoding of annotated text to speech and audio. Source text is parsed into spoken text passages and sound description passages. A speaker identity is determined for each spoken text passage and a sound element for each sound description passage. The speaker identities and sound elements are automatically referenced to a voice and sound effects schema. A voice effect is associated with each speaker identity and a sound effect with each sound element. Each spoken text passage is then annotated with the voice effect associated with its speaker identity and each sound description passage is annotated with the sound effect associated with its sound element. The resulting annotated spoken text and sound description passages are processed to generate output text operable to be transcoded to speech and audio.12-31-2009
20090006096Voice persona service for embedding text-to-speech features into software programs - Described is a voice persona service by which users convert text into speech waveforms, based on user-provided parameters and voice data from a service data store. The service may be remotely accessed, such as via the Internet. The user may provide text tagged with parameters, with the text sent to a text-to-speech engine along with base or custom voice data, and the resulting waveform morphed based on the tags. The user may also provide speech. Once created, a voice persona corresponding to the speech waveform may be persisted, exchanged, made public, shared and so forth. In one example, the voice persona service receives user input and parameters, and retrieves a base or custom voice that may be edited by the user via a morphing algorithm. The service outputs a waveform, such as a .wav file for embedding in a software program, and persists the voice persona corresponding to that waveform.01-01-2009
20080319753TECHNIQUE FOR TRAINING A PHONETIC DECISION TREE WITH LIMITED PHONETIC EXCEPTIONAL TERMS - The present invention discloses a method for training an exception-limited phonetic decision tree. An initial subset of data can be selected and used for creating an initial phonetic decision tree. Additional terms can then be incorporated into the subset. The enlarged subset can be used to evaluate the phonetic decision tree with the results being categorized as either correctly or incorrectly phonetized. An exception-limited phonetic tree can be generated from the set of correctly phonetized terms. If the termination conditions for the method have been determined to be unsatisfactorily met, then steps of the method can be repeated.12-25-2008
20090024393Speech synthesizer and speech synthesis system - A speech synthesizer conducts a dialogue among a plurality of synthesized speakers, including a self speaker and one or more partner speakers, by use of a voice profile table describing emotional characteristics of synthesized voices, a speaker database storing feature data for different types of speakers and/or different speaking tones, a speech synthesis engine that synthesizes speech from input text according to feature data fitting the voice profile assigned to each synthesized speaker, and a profile manager that updates the voice profiles according to the content of the spoken text. The voice profiles of partner speakers are initially derived from the voice profile of the self speaker. A synthesized dialogue can be set up simply by selecting the voice profile of the self speaker.01-22-2009
20100153116METHOD FOR STORING AND RETRIEVING VOICE FONTS - The present invention is a system for storing text-to-speech files which includes a means for storing a plurality of voice fonts wherein each voice font has associated therewith a universal voice identifier (UVI). The invention includes delivering a voice font to a receiver of a message containing text wherein the message contains the UVI and the receiver requests the voice font associated with the UVI from the means for storing.06-17-2010
20090055188PITCH PATTERN GENERATION METHOD AND APPARATUS THEREOF - The prosody control unit pattern generation module generates pitch patterns in respective prosody control units based on language attribute information, the phoneme duration and emphasis degree information, the modification method decision module decides a modification method by smoothing processing with respect to the pitch pattern in a connection portion between the prosody control unit and at least one of previous and next prosody control units based on at least emphasis degree information to generate modification method information, and the pattern connection module modifies pitch patterns generated in respective prosody control units by smoothing processing according to the modification method information and connects them to generate a sentence pitch pattern corresponding to a text to be a target for speech synthesis.02-26-2009
20090099846METHOD AND APPARATUS FOR PREPARING A DOCUMENT TO BE READ BY TEXT-TO-SPEECH READER - There is disclosed a method and system for preparing a document to be read by a text-to-speech reader. The method can include identifying two or more voice types available to the text-to-speech reader, identifying the text elements within the document, grouping related text elements together, and classifying the text elements according to voice types available to the text-to-speech reader. The method of grouping the related text elements together can include syntactic and intelligent clustering. The classification of text elements can include performing latent semantic analysis on the text elements and characteristics of the available voice types.04-16-2009
20090083036Unnatural prosody detection in speech synthesis - Described is a technology by which synthesized speech generated from text is evaluated against a prosody model (trained offline) to determine whether the speech will sound unnatural. If so, the speech is regenerated with modified data. The evaluation and regeneration may be iterative until deemed natural sounding. For example, text is built into a lattice that is then (e.g., Viterbi) searched to find a best path. The sections (e.g., units) of data on the path are evaluated via a prosody model. If the evaluation deems a section to correspond to unnatural prosody, that section is replaced, e.g., by modifying/pruning the lattice and re-performing the search. Replacement may be iterative until all sections pass the evaluation. Unnatural prosody detection may be biased such that during evaluation, unnatural prosody is falsely detected at a higher rate relative to a rate at which unnatural prosody is missed.03-26-2009
20090083037INTERACTIVE DEBUGGING AND TUNING OF METHODS FOR CTTS VOICE BUILDING - A method, a system, and an apparatus for identifying and correcting sources of problems in synthesized speech which is generated using a concatenative text-to-speech (CTTS) technique. The method can include the step of displaying a waveform corresponding to synthesized speech generated from concatenated phonetic units. The synthesized speech can be generated from text input received from a user. The method further can include the step of displaying parameters corresponding to at least one of the phonetic units. The method can include the step of displaying the original recordings containing selected phonetic units. An editing input can be received from the user and the parameters can be adjusted in accordance with the editing input.03-26-2009
20090083035TEXT PRE-PROCESSING FOR TEXT-TO-SPEECH GENERATION - A system and method are provided for improved speech synthesis, wherein text data is pre-processed according to updated grammar rules or a selected group of grammar rules. In one embodiment, the TTS system comprises a first memory adapted to store a text information database, a second memory adapted to store grammar rules, and a receiver adapted to receive update data regarding the grammar rules. The system also includes a TTS engine adapted to retrieve at least one text entry from the text information database, pre-process the at least one text entry by applying the updated grammar rules to the at least one text entry, and generate speech based at least in part on the least one pre-processed text entry.03-26-2009
20110231193SYNTHESIZED SINGING VOICE WAVEFORM GENERATOR - Various technologies for generating a synthesized singing voice waveform. In one implementation, the computer program may receive a request from a user to create a synthesized singing voice using the lyrics of a song and a digital file containing its melody as inputs. The computer program may then dissect the lyrics' text and its melody file into its corresponding sub-phonemic units and musical score respectively. The musical score may be further dissected into a sequence of musical notes and duration times for each musical note. The computer program may then determine a fundamental frequency (F09-22-2011
20120078633READING ALOUD SUPPORT APPARATUS, METHOD, AND PROGRAM - According to one embodiment, a reading aloud support apparatus includes a reception unit, a first extraction unit, a second extraction unit, an acquisition unit, a generation unit, a presentation unit. The reception unit is configured to receive an instruction. The first extraction unit is configured to extract, as a partial document, a part of a document which corresponds to a range of words. The second extraction unit is configured to perform morphological analysis and to extract words as candidate words. The acquisition unit is configured to acquire attribute information items relates to the candidate words. The generation unit is configured to perform weighting relating to a value corresponding a distance and to determine each of candidate words to be preferentially presented to generate a presentation order. The presentation unit is configured to present the candidate words and the attribute information items in accordance with the presentation order.03-29-2012
20090204402METHOD AND APPARATUS FOR CREATING CUSTOMIZED PODCASTS WITH MULTIPLE TEXT-TO-SPEECH VOICES - Method and apparatus for creating customized podcasts with multiple voices, where text content is converted into audio content, and where the voices are selected at least in part on words in the text content suggestive of the type of voice. Types of voice include at least male and female, accent, language, and speed.08-13-2009
20090204401SPEECH PROCESSING SYSTEM, SPEECH PROCESSING METHOD, AND SPEECH PROCESSING PROGRAM - Provided is a speech translation system for receiving an input of the original speech in a first language, translating an input content into a second language, and outputting a result of the translating as a speech, including: an input processing part for receiving the input of the original speech, and generating, from the original speech, an original language text and the prosodic information of the original speech; a translation part for generating a translated sentence by translating the first language into the second language; prosodic feature transform information including associated prosodic information between the first language and the second language; a prosodic feature transform part for transforming the prosodic information of the original speech into prosodic information of the speech to be output; and a speech synthesis part for outputting the translated sentence as a speech synthesized based on the prosodic information of the speech to be output.08-13-2009
20080262846Wireless server based text to speech email - An email system for mobile devices, such as cellular phones and PDAs, is disclosed which allows email messages to be played back on the mobile device as voice messages on demand by way of a media player, thus eliminating the need for a unified messaging system. Email messages are received by the mobile device in a known manner. In accordance with an important aspect of the invention, the email messages are identified by the mobile device as they are received. After the message is identified, the mobile device sends the email message in text format to a server for conversion to speech or voice format. After the message is converted to speech format, the server sends the messages back to the user's mobile device and notifies the user of the email message and then plays the message back to the user through a media player upon demand.10-23-2008
20080262845METHOD TO TRANSLATE, CACHE AND TRANSMIT TEXT-BASED INFORMATION CONTAINED IN AN AUDIO SIGNAL - A method, system and computer-readable medium for generating, caching and transmitting textual equivalents of information contained in an audio signal are presented. The method includes generating a textual equivalent of at least a portion of a speech-based audio signal in one device into a textual equivalent, storing a portion of the textual equivalent in first device's memory and transmitting the stored textual equivalent to a another device.10-23-2008
20110231192System and Method for Audio Content Generation - A system and method for generating audio content. Content is automatically retrieved from an original website according to a predetermined schedule to generate retrieved content. The retrieved content is converted to one or more audio file. A hierarchy is assigned to the one or more audio files to provide an audible website that mimics a hierarch of the retrieved content as represented at the original website. The audible website is stored in a database for retrieval by one or more users. A first user input is received indicating an attempt to access the original website. The audible website is indicated as being associated with the original website in response to the user selection. Portion of the audible website are played in response to a second user input.09-22-2011
20090198497METHOD AND APPARATUS FOR SPEECH SYNTHESIS OF TEXT MESSAGE - Provided is a method and apparatus for speech synthesis of a text message. The method includes receiving input of voice parameters for a text message, storing each of the text message and the input voice parameters in a data packet, and transmitting the data packet to a receiving terminal.08-06-2009
20090204404METHOD AND APPARATUS FOR CONTROLLING PLAY OF AN AUDIO SIGNAL - Apparatus and methods conforming to the present invention comprise a method of controlling playback of an audio signal through analysis of a corresponding close caption signal in conjunction with analysis of the corresponding audio signal. Objection text or other specified text in the close caption signal is identified through comparison with user identified objectionable text. Upon identification of the objectionable text, the audio signal is analyzed to identify the audio portion corresponding to the objectionable text. Upon identification of the audio portion, the audio signal may be controlled to mute the audible objectionable text.08-13-2009
20090204403SPEECH GENERATING MEANS FOR USE WITH SIGNAL SENSORS - An apparatus includes receiving circuitry for receiving a signal; and a speech module for converting the signal into speech.08-13-2009
20090254347PROACTIVE COMPLETION OF INPUT FIELDS FOR AUTOMATED VOICE ENABLEMENT OF A WEB PAGE - Embodiments of the present invention provide a method and computer program product for the proactive completion of input fields for automated voice enablement of a Web page. In an embodiment of the invention, a method for proactively completing empty input fields for voice enabling a Web page can be provided. The method can include receiving speech input for an input field in a Web page and inserting a textual equivalent to the speech input into the input field in a Web page. The method further can include locating an empty input field remaining in the Web page and generating a speech grammar for the input field based upon permitted terms in a core attribute of the empty input field and prompting for speech input for the input field. Finally, the method can include posting the received speech input and the grammar to an automatic speech recognition (ASR) engine and inserting a textual equivalent to the speech input provided by the ASR engine into the empty input field.10-08-2009
20090248417SPEECH PROCESSING APPARATUS, METHOD, AND COMPUTER PROGRAM PRODUCT - A method to generate a pitch contour for speech synthesis is proposed. The method is based on finding the pitch contour that maximizes a total likelihood function created by the combination of all the statistical models of the pitch contour segments of an utterance, at one or multiple linguistic levels. These statistical models are trained from a database of spoken speech, by means of a decision tree that for each linguistic level clusters the parametric representation of the pitch segments extracted from the spoken speech data with some features obtained from the text associated with that speech data. The parameterization of the pitch segments is performed in such a way, the likelihood function of any linguistic level can be expressed in terms of the parameters of one of the levels, thus allowing the maximization to be calculated with respect to the parameters of that level. Moreover, the parameterization of that main level has to be invertible so that the final pitch contour is obtained from the parameters of that level by means of an inverse transformation.10-01-2009
20090259473METHODS AND APPARATUS TO PRESENT A VIDEO PROGRAM TO A VISUALLY IMPAIRED PERSON - Methods and apparatus to present a video program to a visually impaired person are disclosed. An example method comprises receiving a video stream and an associated audio stream of a video program, detecting a portion of the video program that is not readily consumable by a visually impaired person, obtaining text associated with the portion of the video program, converting the text to a second audio stream, and combining the second audio stream with the associated audio stream.10-15-2009
20090254349SPEECH SYNTHESIZER - A speech synthesizer can execute speech content editing at high speed and generate speech content easily. The speech synthesizer includes a small speech element DB (10-08-2009
20090254348FREE FORM INPUT FIELD SUPPORT FOR AUTOMATED VOICE ENABLEMENT OF A WEB PAGE - Embodiments of the present invention provide a method and computer program product for the automated voice enablement of a Web page with free form input field support. In an embodiment of the invention, a method for voice enabling a Web page with free form input field support can be provided. The method can include receiving speech input for an input field in a Web page, parsing a core attribute for the input field and identifying an external statistical language model (SLM) referenced by the core attribute of the input field, posting the received speech input and the SLM to an automatic speech recognition (ASR) engine, and inserting a textual equivalent to the speech input provided by the ASR engine in conjunction with the SLM into the input field.10-08-2009
20090254345Intelligent Text-to-Speech Conversion - Techniques for improved text-to-speech processing are disclosed. The improved text-to-speech processing can convert text from an electronic document into an audio output that includes speech associated with the text as well as audio contextual cues. One aspect provides audio contextual cues to the listener when outputting speech (spoken text) pertaining to a document. The audio contextual cues can be based on an analysis of a document prior to a text-to-speech conversion. Another aspect can produce an audio summary for a file. The audio summary for a document can thereafter be presented to a user so that the user can hear a summary of the document without having to process the document to produce its spoken text via text-to-speech conversion.10-08-2009
20090259472SYSTEM AND METHOD FOR ANSWERING A COMMUNICATION NOTIFICATION - Disclosed herein are systems, methods, and computer readable-media for answering a communication notification. The method for answering a communication notification comprises receiving a notification of communication from a user, converting information related to the notification to speech, outputting the information as speech to the user, and receiving from the user an instruction to accept or ignore the incoming communication associated with the notification. In one embodiment, information related to the notification comprises one or more of a telephone number, an area code, a geographic origin of the request, caller id, a voice message, address book information, a text message, an email, a subject line, an importance level, a photograph, a video clip, metadata, an IP address, or a domain name. Another embodiment involves notification assigned an importance level and repeat attempts at notification if it is of high importance.10-15-2009
20090259471DISTANCE METRICS FOR UNIVERSAL PATTERN PROCESSING TASKS - A universal pattern processing system receives input data and produces output patterns that are best associated with said data. The system uses input means receiving and processing input data, a universal pattern decoder means transforming models using the input data and associating output patterns with original models that are changed least during transforming, and output means outputting best associated patterns chosen by a pattern decoder means.10-15-2009
20100191533CHARACTER INFORMATION PRESENTATION DEVICE - The text information presentation device calculates an optimum readout speed on the basis of the content of text information being input, its arriving time, and its previous arriving time; speech-synthesizes text information being input, at the readout speed calculated; and outputs it as an audio signal, or alternatively controls the speed at which a video signal is output according to an output state of the speech synthesizing unit.07-29-2010
20080208584Pausing A VoiceXML Dialog Of A Multimodal Application - Pausing a VoiceXML dialog of a multimodal application, including generating by the multimodal application a pause event; responsive to the pause event, temporarily pausing the dialogue by the VoiceXML interpreter; generating by the multimodal application a resume event; and responsive to the resume event, resuming the dialog. Embodiments are implemented with the multimodal application operating on a multimodal device supporting multiple modes of interaction including a voice mode and one or more non-voice modes, the multimodal application is operatively coupled to a VoiceXML interpreter, and the VoiceXML interpreter is interpreting the VoiceXML dialog to be paused.08-28-2008
20100228549SYSTEMS AND METHODS FOR DETERMINING THE LANGUAGE TO USE FOR SPEECH GENERATED BY A TEXT TO SPEECH ENGINE - Algorithms for synthesizing speech used to identify media assets are provided. Speech may be selectively synthesized from text strings associated with media assets, where each text string can be associated with a native string language (e.g., the language of the string). When several text strings are associated with at least two distinct languages, a series of rules can be applied to the strings to identify a single voice language to use for synthesizing the speech content from the text strings. In some embodiments, a prioritization scheme can be applied to the text strings to identify the more important text strings. The rules can include, for example, selecting a voice language based on the prioritization scheme, a default language associated with an electronic device, the ability of a voice language to speak text in a different language, or any other suitable rule.09-09-2010
20100145705AUDIO WITH SOUND EFFECT GENERATION FOR TEXT-ONLY APPLICATIONS - A method of generating audio for a text-only application comprises the steps of adding tag to an input text, said tag is usable for adding sound effect to the generated audio; processing the tag to form instructions for generating the audio; generating audio with said effect based on the instructions, while the text being presented. The present invention adds entertainment value to text applications and provides very compact format compared to conventional multimedia as well as uses entertainment sound to make text-only applications such as SMS and email more fun and entertaining.06-10-2010
20100217600ELECTRONIC DEVICE AND METHOD OF ASSOCIATING A VOICE FONT WITH A CONTACT FOR TEXT-TO-SPEECH CONVERSION AT THE ELECTRONIC DEVICE - A method of associating a voice font with a contact for text-to-speech conversion at an electronic device includes obtaining, at the electronic device, the voice font for the contact, and storing the voice font in association with a contact data record stored in a contacts database at the electronic device. The contact data record includes contact data for the contact.08-26-2010
20090076820METHOD AND APPARATUS FOR TAGTOE REMINDERS - A network-based text-to-speech (TTS) TagToe alert system is configured to take a user's textual and/or multimedia input to a TagToe user interface to schedule delivery of text-to-speech-converted TagToe information to one or more telephone call recipients. The text-to-speech converted TagToe information optionally includes e-commerce specific, location-specific, and/or product-specific information which may be presented to the one or more call recipients as additional voice information or interactive voice response (IVR) information. The TagToe alert system can be configured to provide an advanced level of integration between IP telephony and electronic transactions and online services for optimized efficiency and improved revenue to e-commerce.03-19-2009
20090076819Text to speech synthesis - An input linguistic description is converted into a speech waveform by deriving at least one target unit sequence corresponding to the linguistic description, selecting from a waveform unit database for the target unit sequences a plurality of alternative unit sequences approximating the target unit sequences, concatenating the alternative unit sequences to alternative speech waveforms and presenting the alternative speech waveforms to an operating person and enabling the choice of one of the presented alternative speech waveforms. There are no iterative cycles of manual modification and automatic selection, which enables a fast way of working. The operator does not need knowledge of units, targets, and costs, but chooses from a set of given alternatives. The fine-tuning of TTS prompts therefore becomes accessible to non-experts.03-19-2009
20100211393SPEECH SYNTHESIS DEVICE, SPEECH SYNTHESIS METHOD, AND SPEECH SYNTHESIS PROGRAM - A speech synthesis device is provided with: a central segment selection unit for selecting a central segment from among a plurality of speech segments; a prosody generation unit for generating prosody information based on the central segment; a non-central segment selection unit for selecting a non-central segment, which is a segment outside of a central segment section, based on the central segment and the prosody information; and a waveform generation unit for generating a synthesized speech waveform based on the prosody information, the central segment, and the non-central segment. The speech synthesis device first selects a central segment that forms a basis for prosody generation and generates prosody information based on the central segment so that it is possible to sufficiently reduce both concatenation distortion and sound quality degradation accompanying prosody control in the section of the central segment.08-19-2010
20100324902Systems and Methods Document Narration - Disclosed are techniques and systems to provide a narration of a text in multiple different voices. In some aspects, systems and methods described herein can include receiving a user-based selection of a first portion of words in a document where the document has a pre-associated first voice model and overwriting the association of the first voice model, by the one or more computers, with a second voice model for the first portion of words.12-23-2010
20100211392SPEECH SYNTHESIZING DEVICE, METHOD AND COMPUTER PROGRAM PRODUCT - The speech synthesizing device acquires numerical data at regular time intervals, each piece of the numerical data representing a value having a plurality of digits, detects a change between two values represented by the numerical data that is acquired at two consecutive times, determines which digit of the value represented by the numerical data is used to generate speech data depending on the detected change, generates numerical information that indicates the determined digit of the value represented by the numerical data, and generates speech data from the digit indicated by the numerical information.08-19-2010
20130218568SPEECH SYNTHESIS DEVICE, SPEECH SYNTHESIS METHOD, AND COMPUTER PROGRAM PRODUCT - According to an embodiment, a speech synthesis device includes a first storage, a second storage, a first generator, a second generator, a third generator, and a fourth generator. The first storage is configured to store therein first information obtained from a target uttered voice. The second storage is configured to store therein second information obtained from an arbitrary uttered voice. The first generator is configured to generate third information by converting the second information so as to be close to a target voice quality or prosody. The second generator is configured to generate an information set including the first information and the third information. The third generator is configured to generate fourth information used to generate a synthesized speech, based on the information set. The fourth generator configured to generate the synthesized speech corresponding to input text using the fourth information.08-22-2013
20100070282METHOD AND APPARATUS FOR IMPROVING TRANSACTION SUCCESS RATES FOR VOICE REMINDER APPLICATIONS IN E-COMMERCE - Methods and apparatuses are disclosed for improving transaction success rates for voice reminder applications in e-commerce. In one embodiment of the invention, the voice reminder applications in e-commerce utilizes a network-based text-to-speech (TTS) alert system, which can generate a purchase reminder associated with a recipient's potential purchase. The network-based text-to-speech (TTS) alert system can also deliver the purchase reminder to a recipient's voicemail and leave a transaction identifier number and a centralized or a recipient-specific call-back phone number to the recipient's voicemail. A recipient can utilize the transaction identifier number, the centralized or the recipient-specific call-back phone number, and optionally a recipient-specific password to make a phone call to retrieve the purchase reminder previously delivered to the recipient's voicemail by the network-based text-to-speech (TTS) alert system. Then, the recipient can authorize and/or complete a transaction related to the purchase reminder over the same phone call.03-18-2010
20120143611Trajectory Tiling Approach for Text-to-Speech - Hidden Markov Models HMM trajectory tiling (HTT)-based approaches may be used to synthesize speech from text. In operation, a set of Hidden Markov Models (HMMs) and a set of waveform units may be obtained from a speech corpus. The set of HMMs are further refined via minimum generation error (MGE) training to generate a refined set of HMMs. Subsequently, a speech parameter trajectory may be generated by applying the refined set of HMMs to an input text. A unit lattice of candidate waveform units may be selected from the set of waveform units based at least on the speech parameter trajectory. A normalized cross-correlation (NCC)-based search on the unit lattice may be performed to obtain a minimal concatenation cost sequence of candidate waveform units, which are concatenated into a concatenated waveform sequence that is synthesized into speech.06-07-2012
20090326949SYSTEM AND METHOD FOR EXTRACTION OF META DATA FROM A DIGITAL MEDIA STORAGE DEVICE FOR MEDIA SELECTION IN A VEHICLE - A method is provided for extracting meta data from a digital media storage device in a vehicle over a communication link between a control module of the vehicle and the digital media storage device. The method includes establishing a communication link between control module of the vehicle and the digital media storage device, identifying a media file on the digital media storage device, and retrieving meta data from a media file, the meta data including a plurality of entries, wherein at least one of the plurality of entries includes text data. The method further includes identifying the text data in an entry of the media file and storing the plurality of entries in a memory.12-31-2009
20090018836SPEECH SYNTHESIS SYSTEM AND SPEECH SYNTHESIS METHOD - In a speech synthesis, a selecting unit selects one string from first speech unit strings corresponding to a first segment sequence obtained by dividing a phoneme string corresponding to target speech into segments. The selecting unit performs repeatedly generating, based on maximum W second speech unit strings corresponding to a second segment sequence as a partial sequence of the first sequence, third speech unit strings corresponding to a third segment sequence obtained by adding a segment to the second sequence, and selecting maximum W strings from the third strings based on a evaluation value of each of the third strings. The value is obtained by correcting a total cost of each of the third string candidate with a penalty coefficient for each of the third strings. The coefficient is based on a restriction concerning quickness of speech unit data acquisition, and depends on extent in which the restriction is approached.01-15-2009
20090150157SPEECH PROCESSING APPARATUS AND PROGRAM - A word dictionary including sets of a character string which constitutes a word, a phoneme sequence which constitutes pronunciation of the word and a part of speech of the word is referenced, an entered text is analyzed, the entered text is divided into one or more subtexts, a phoneme sequence and a part of speech sequence are generated for each subtext, the part of speech sequence of the subtext and a list of part of speech sequence are collated to determine whether the phonetic sound of the subtext is to be converted or not, and the phonetic sounds of the phoneme sequence in the subtext whose phonetic sounds are determined to be converted are converted.06-11-2009
20090112597PREDICTING A RESULTANT ATTRIBUTE OF A TEXT FILE BEFORE IT HAS BEEN CONVERTED INTO AN AUDIO FILE - An apparatus for predicting a resultant attribute of a text file before it has been converted to an audio file by a text-to-speech converter application. In accordance with an embodiment, the apparatus includes: a receiver component for receiving a text file and a request to determine a resultant attribute of the text file before it is converted to an audio file, by a text-to-speech converter component; a calculation component for determining a file type associated with the received text file and the size of the received text file; a calculation component for identifying an attribute associated with the determined file type; and a calculation component for determining from the identified attribute and the size of the received text file a resultant attribute of the text file before it is converted to an audio file by the text-to-speech converter component.04-30-2009
20090112596SYSTEM AND METHOD FOR IMPROVING SYNTHESIZED SPEECH INTERACTIONS OF A SPOKEN DIALOG SYSTEM - A system and method are disclosed for synthesizing speech based on a selected speech act. A method includes modifying synthesized speech of a spoken dialogue system, by (1) receiving a user utterance, (2) analyzing the user utterance to determine an appropriate speech act, and (3) generating a response of a type associated with the appropriate speech act, wherein in linguistic variables in the response are selected, based on the appropriate speech act.04-30-2009
20100312564LOCAL AND REMOTE FEEDBACK LOOP FOR SPEECH SYNTHESIS - A local text to speech feedback loop is utilized to modify algorithms used in speech synthesis to provide a user with an improved experience. A remote text to speech feedback loop is utilized to aggregate local feedback loop data and incorporate best solutions into new improved text to speech engine for deployment.12-09-2010
20100318364SYSTEMS AND METHODS FOR SELECTION AND USE OF MULTIPLE CHARACTERS FOR DOCUMENT NARRATION - Disclosed are techniques and systems to provide a narration of a text in multiple different voices. Further disclosed are techniques and systems for generating an audible output in which different portions of a text are narrated using voice models associated with different characters.12-16-2010
20100312562HIDDEN MARKOV MODEL BASED TEXT TO SPEECH SYSTEMS EMPLOYING ROPE-JUMPING ALGORITHM - A rope-jumping algorithm is employed in a Hidden Markov Model based text to speech system to determine start and end models and to modify the start and end models by setting small co-variances. Disordered acoustic parameters due to violation of parameter constraints are avoided through the modification and result in stable line frequency spectrum for the generated speech.12-09-2010
20090070114AUDIBLE METADATA - This disclosure describes systems and methods for audibly presenting metadata. Audibly presentable metadata is referred to as audible metadata. Audible metadata may be associated with one or more media objects. In one embodiment, audible metadata is pre-recorded requiring little or no processing before it can be rendered. In another embodiment, audible metadata is text, and a text-to-speech conversion device may be used to convert the text into renderable audible metadata. Audible metadata may be rendered at any point before or after rendering of a media object, or may be rendered during rendering of a media object via a dynamic user request.03-12-2009
20090070115SPEECH SYNTHESIS SYSTEM, SPEECH SYNTHESIS PROGRAM PRODUCT, AND SPEECH SYNTHESIS METHOD - It is an objective of the present invention to provide waveform concatenation speech synthesis with high sound quality utilizing its advantages in the case where there is a large quantity of speech segments while providing waveform concatenation speech synthesis with accurate accents in other cases. Prosody with both high accuracy and high sound quality is achieved by performing a two-path search including a speech segment search and a prosody modification value search. In the preferred embodiment of the present invention, an accurate accent is secured by evaluating the consistency of the prosody by using a statistical model of prosody variations (the slope of fundamental frequency) for both of two paths of the speech segment selection and the modification value search. In the prosody modification value search, a prosody modification value sequence that minimizes a modified prosody cost is searched for. This allows a search for a modification value sequence that can increase the likelihood of absolute values or variations of the prosody to the statistical model as high as possible with minimum modification values.03-12-2009
20090070116FUNDAMENTAL FREQUENCY PATTERN GENERATION APPARATUS AND FUNDAMENTAL FREQUENCY PATTERN GENERATION METHOD - A fundamental frequency pattern generation apparatus includes a first storage including representative vectors each corresponding to a prosodic control unit and having a section for changing the number of phonemes, a second storage unit including a rule to select a vector corresponding to an input context, a selection unit configured to select a vector from the representative vectors by applying the rule to the context and output the selected vector, a calculation unit configured to calculate an expansion/contraction ratio of the section of the selected vector in a time-axis direction based on a designated value for a specific feature amount related to a length of a fundamental frequency pattern to be generated, the designated value of the feature amount being required of the fundamental frequency pattern to be generated, and an expansion/contraction unit configured to expand/contract the selected vector based on the expansion/contraction ratio to generate the fundamental frequency pattern.03-12-2009
20100312563TECHNIQUES TO CREATE A CUSTOM VOICE FONT - Techniques to create and share custom voice fonts are described. An apparatus may include a preprocessing component to receive voice audio data and a corresponding text script from a client and to process the voice audio data to produce prosody labels and a rich script. The apparatus may further include a verification component to automatically verify the voice audio data and the text script. The apparatus may further include a training component to train a custom voice font from the verified voice audio data and rich script and to generate custom voice font data usable by the TTS component. Other embodiments are described and claimed.12-09-2010
20110010178SYSTEM AND METHOD FOR TRANSFORMING VERNACULAR PRONUNCIATION - Provided is a system and method for transforming vernacular pronunciation with respect to Hanja using a statistical method. In a system for transforming vernacular pronunciation, a vernacular pronunciation extracting unit extracts a vernacular pronunciation with respect to a Hanja character string, a statistical data determining unit determines a statistical data with respect to the Hanja character string by using statistical data of features related to a Hanja-vernacular pronunciation transformation, and a vernacular pronunciation transforming unit transforms the Hanja character string into a vernacular pronunciation using the extracted vernacular pronunciation and the determined statistical data.01-13-2011
20110035223AUDIO CLIPS FOR ANNOUNCING REMOTELY ACCESSED MEDIA ITEMS - Systems and methods for retrieving and playing back audio clips for streamed or remotely received media items are provided. An electronic device can provide audio clips identifying media items at any suitable time, including for example to identify media items that are currently played back or available for playback. When the media items played back are not locally stored, the electronic device may not have a corresponding audio clip locally stored. In such cases, the electronic device can identify a streamed media item, and retrieve an audio clip corresponding to text items associated with the media item. For example, the electronic device can retrieve audio clips corresponding to the artist, title and album of the received media item. The electronic device can retrieve audio clips from any suitable source, such as a dedicated audio clip server or other remote source, a remote text-to-speech engine, or a locally stored text-to-speech engine.02-10-2011
20090063152AUDIO REPRODUCING METHOD, CHARACTER CODE USING DEVICE, DISTRIBUTION SERVICE SYSTEM, AND CHARACTER CODE MANAGEMENT METHOD - A character code is associated with sound as well as character or sign so as to enhance expressiveness on the Internet or in electronic mail. Sound data is recorded in the character code using device in association with the character code. The user can reproduce an intended sound in the same way as he or she displays a character on the character code using device, whereby the user can enhance his or her expressiveness on the Internet or in electronic mail, for example.03-05-2009
20100145704SYSTEM AND METHOD FOR INCREASING RECOGNITION RATES OF IN-VOCABULARY WORDS BY IMPROVING PRONUNCIATION MODELING - Disclosed herein are systems, methods, and computer readable-media for generating a lexicon for use with speech recognition. The method includes receiving symbolic input as labeled speech data, overgenerating potential pronunciations based on the symbolic input, identifying best potential pronunciations in a speech recognition context, and storing the identified best potential pronunciations in a lexicon. Overgenerating potential pronunciations can include establishing a set of conversion rules for short sequences of letters, converting portions of the symbolic input into a number of possible lexical pronunciation variants based on the set of conversion rules, modeling the possible lexical pronunciation variants in one of a weighted network and a list of phoneme lists, and iteratively retraining the set of conversion rules based on improved pronunciations. Symbolic input can include multiple examples of a same spoken word. Speech data can be labeled explicitly or implicitly and can include words as text and recorded audio.06-10-2010
20110246201SYSTEM FOR PROVIDING AUDIO MESSAGES ON A MOBILE DEVICE - While performing a function, a mobile device identifies that it is idle while it is downloading content or performing another task. During that idle time, it gathers one or more parameters (e.g., location, time, gender of user, age of user, etc.) and sends a request for an audio message (e.g., audio advertisement). One or more servers at a remote facility receive the request with the one or more parameters, and use the parameters to identify a targeted message. In some cases, the targeted message will include one or more dynamic variables (e.g., distance to store, time to event, etc.) that will be replaced based on the parameters received from the mobile device, so that the audio message is dynamically updated and customized for the mobile device. In one embodiment, the targeted message is transmitted to the mobile device as text. After being received at the mobile device, the text is optionally displayed and converted to an audio format and played for the user.10-06-2011
20110246200PRE-SAVED DATA COMPRESSION FOR TTS CONCATENATION COST - Pre-saved concatenation cost data is compressed through speech segment grouping. Speech segments are assigned to a predefined number of groups based on their concatenation cost values with other speech segments. A representative segment is selected for each group. The concatenation cost between two segments in different groups may then be approximated by that between the representative segments of their respective groups, thereby reducing an amount of concatenation cost data to be pre-saved.10-06-2011
20100114579System and Method of Controlling Sound in a Multi-Media Communication Application - A computing device and computer-readable medium storing instructions for controlling a computing device to customize a voice in a multi-media message created by a sender for a recipient, the multi-media message comprising a text message from the sender to be delivered by an animated entity. The instructions comprise receiving from the sender inserted voice emoticons, which may be repeated, into the text message associated with parameters of a voice used by an animated entity to deliver the text message; and transmitting the text message such that a recipient device can deliver the multi-media message at a variable level associated with a number of times a respective voice emoticon is repeated.05-06-2010
20090048842Generalized Object Recognition for Portable Reading Machine - Techniques for operating a reading machine are disclosed. The techniques include forming an N-dimensional features vector based on features of an image, the features corresponding to characteristics of at least one object depicted in the image, representing the features vector as a point in n-dimensional space, where n corresponds to N, the number of features in the features vector and comparing the point in n-dimensional space to a centroid that represents a cluster of points in the n-dimensional space corresponding to a class of objects to determine whether the point belongs in the class of objects corresponding to the centroid.02-19-2009
20090048841Synthesis by Generation and Concatenation of Multi-Form Segments - A speech synthesis system and method is described. A speech segment database references speech segments having various different speech representational structures. A speech segment selector selects from the speech segment database a sequence of speech segment candidates corresponding to a target text. A speech segment sequencer generates from the speech segment candidates sequenced speech segments corresponding to the target text. A speech segment synthesizer combines the selected sequenced speech segments to produce a synthesized speech signal output corresponding to the target text.02-19-2009
20090048840DEVICE FOR CONVERTING INSTANT MESSAGE INTO AUDIO OR VISUAL RESPONSE - The conversion device is connected, by a wired or wireless means, to an input/output port of a computing device installed with an instant messaging software. As instant messages are exchanged, the conversion device is activated by the instant messaging software to produce audio and/or visual responses in accordance with specific texts, symbols, and graphical images contained in the messages received. The conversion device could have an appealing appearance such as a doll, a puppet, or a toy figure. The conversion device can further contain at least an actuation mechanism such that, when activated, the conversion device sends a specific signal to the instant messaging software which encodes and packages the signal into a message and delivers the message to a remote computing device.02-19-2009
20090313023Multilingual text-to-speech system - The invention converts raw data in a base language (e.g. English) into conversational formatted messages in multiple languages. The process converts input data rows into related sequences to a set of prerecorded audio phrase files. The sequences reference both recorded phrases of input data components and user-created text phrases inserted before and after the input data. When the audio sequences are played in sequence, a coherent conversational message in the language of the caller results. An IVR server responding to a caller's menu selection uses the invention's output data to generate the coherent response. Two embodiment are presented, a simple embodiment that responds to messages, and a more complex embodiment that converts enterprise demographic and member-event data collected over a period into audio sentences played in response to a menu item section by a caller in the caller's language.12-17-2009
20090037178ANSWER AN INCOMING VOICE CALL WITHOUT REQUIRING A USER TO SPEAK - A system comprises a wireless transceiver and logic coupled to the wireless transceiver. The logic is adapted to answer a phone call from a calling party with an automated voice message and then, in the same phone call, to enable a user to have a two-way conversation with the calling party without requiring the user to speak.02-05-2009
20100063821Hands-Free and Non-Visually Occluding Object Information Interaction System - Technologies are described herein for providing a hands-free and non-visually occluding interaction with object information. In one method, a visual capture of a portion of an object is received through a hands-free and non-visually occluding visual capture device. An audio capture is also received from a user through a hands-free and non-visually occluding audio capture device. The audio capture may include a request for information about a portion of the object in the visual capture. The information is retrieved and is transmitted to the user for playback through a hands-free and non-visually occluding audio output device.03-11-2010
20090216536IMAGE PROCESSING APPARATUS, IMAGE PROCESSING METHOD AND RECORDING MEDIUM - An image processing apparatus comprises an image data input portion that inputs image data and a text data input portion that inputs text data. The text data inputted by the text data input portion is converted into voice data by a voice data converter, and this obtained voice data and the image data inputted by the image data input portion are connected to each other by a connector, and then a file including the voice data and the image data connected to each other is created.08-27-2009
20100057465VARIABLE TEXT-TO-SPEECH FOR AUTOMOTIVE APPLICATION - A text-to-speech (TTS) system implemented in an automotive vehicle is dynamically tuned to improve intelligibility over a wide variety of vehicle operating states and environmental conditions. In one embodiment of the present invention, a TTS system is interfaced to one or more vehicle sensors to measure parameters including vehicle speed, interior noise, visibility conditions, and road roughness, among others. In response to measurements of these operating parameters, TTS voice volume, pitch, and speed, among other parameters, may be tuned in order to improve intelligibility of the TTS voice system and increase its effectiveness for the operator of the vehicle.03-04-2010
20100070281SYSTEM AND METHOD FOR AUDIBLY PRESENTING SELECTED TEXT - Disclosed herein are methods for presenting speech from a selected text that is on a computing device. This method includes presenting text on a touch-sensitive display and having that text size within a threshold level so that the computing device can accurately determine the intent of the user when the user touches the touch screen. Once the user touch has been received, the computing device identifies and interprets the portion of text that is to be selected, and subsequently presents the text audibly to the user.03-18-2010
20100241432PROVIDING DESCRIPTIONS OF VISUALLY PRESENTED INFORMATION TO VIDEO TELECONFERENCE PARTICIPANTS WHO ARE NOT VIDEO-ENABLED - Descriptions of visually presented material are provided to one or more conference participants that do not have video capabilities. This presented material could be any one or more of a document, PowerPoint® presentation, spreadsheet, Webex® presentation, whiteboard, chalkboard, interactive whiteboard, description of a flowchart, picture, or in general, any information visually presented at a conference. For this visually presented information, descriptions thereof are assembled and forwarded via one or more of a message, SMS message, whisper channel, text information, non-video channel, MSRP, or the like, to one or more conference participant endpoints. These descriptions of visually presented information, such as a document, spreadsheet, spreadsheet presentation, multi-media presentation, or the like, can be assembled in cooperation with one or more of OCR recognition and text-to-speech conversion, human input, or the like.09-23-2010
20110071835SMALL FOOTPRINT TEXT-TO-SPEECH ENGINE - Embodiments of small footprint text-to-speech engine are disclosed. In operation, the small footprint text-to-speech engine generates a set of feature parameters for an input text. The set of feature parameters includes static feature parameters and delta feature parameters. The small footprint text-to-speech engine then derives a saw-tooth stochastic trajectory that represents the speech characteristics of the input text based on the static feature parameters and the delta parameters. Finally, the small footprint text-to-speech engine produces a smoothed trajectory from the saw-tooth stochastic trajectory, and generates synthesized speech based on the smoothed trajectory.03-24-2011
20120303371METHODS AND APPARATUS FOR ACOUSTIC DISAMBIGUATION - Techniques for disambiguating at least one text segment from at least one acoustically similar word and/or phrase. The techniques include identifying at least one text segment, in a textual representation having a plurality of text segments, having at least one acoustically similar word and/or phrase, annotating the textual representation with disambiguating information to help disambiguate the at least one text segment from the at least one acoustically similar word and/or phrase, and synthesizing a speech signal, at least in part, by performing text-to-speech synthesis on at least a portion of the textual representation that includes the at least one text segment, wherein the speech signal includes speech corresponding to the disambiguating information located proximate the portion of the speech signal corresponding to the at least one text segment.11-29-2012
20110060590SYNTHETIC SPEECH TEXT-INPUT DEVICE AND PROGRAM - A synthetic speech text-input device is provided that allows a user to intuitively know an amount of an input text that can be fit in a desired duration. A synthetic speech text-input device 03-10-2011
20130191130SPEECH SYNTHESIS METHOD AND APPARATUS FOR ELECTRONIC SYSTEM - A speech synthesis method for an electronic system and a speech synthesis apparatus are provided. In the speech synthesis method, a speech signal file including text content is received. The speech signal file is analyzed to obtain prosodic information of the speech signal file. The text content and the corresponding prosodic information are automatically tagged to obtain a text tag file. A speech synthesis file is obtained by synthesizing a human voice profile and the text tag file.07-25-2013
20120150543Personality-Based Device - A personality-based theme may be provided. An application program may query a personality resource file for a prompt corresponding to a personality. Then the prompt may be received at a speech synthesis engine. Next, the speech synthesis engine may query a personality voice font database for a voice font corresponding to the personality. Then the speech synthesis engine may apply the voice font to the prompt. The voice font applied prompt may then be produced at an output device.06-14-2012
20120065979METHOD AND SYSTEM FOR TEXT TO SPEECH CONVERSION - A system and method for text to speech conversion. The method of performing text to speech conversion on a portable device includes: identifying a portion of text for conversion to speech format, wherein the identifying includes performing a prediction based on information associated with a user. While the portable device is connected to a power source, a text to speech conversion is performed on the portion of text to produce converted speech. The converted speech is stored into a memory device of the portable device. A reader application is executed, wherein a user request is received for narration of the portion of text. During the executing, the converted speech is accessed from the memory device and rendered to the user, responsive to the user request03-15-2012
20120203554SYSTEMS AND METHODS FOR PROVIDING EMERGENCY INFORMATION - In one general aspect, emergency information for a person is received from a user. A unique identifier for the person is generated. The unique identifier is associated with the emergency information. The emergency information is stored on an emergency information device. The unique identifier is associated with the emergency information device. The emergency information device is sent to the user.08-09-2012
20090313021METHODS AND SYSTEMS FOR SIGHT IMPAIRED WIRELESS CAPABILITY - A method for sending data to a sight impaired user, the method comprising, receiving data from a data resource, determining whether the data is compatible with a Symbian API, transcoding the data into a first format compatible with the Symbian API, determining whether the data is compatible with a TALKS filter, transcoding the data into a second format compatible with the TALKS filter, determining whether the data is usable by a sight impaired user, transcoding the data into a third format usable by a sight impaired user responsive to determining that the data is not usable by a sight impaired user, converting a data type definition associated with the data into a format compatible with a user profile, sending the received data to a user mobile device, wherein the mobile device is operative to convert the data into an audible output.12-17-2009
20090240501AUTOMATICALLY GENERATING NEW WORDS FOR LETTER-TO-SOUND CONVERSION - Described is a technology by which artificial words are generated based on seed words, and then used with a letter-to-sound conversion model. To generate an artificial word, a stressed syllable of a seed word is replaced with a different syllable, such as a candidate (artificial) syllable, when the phonemic structure and/or graphonemic structure of the stressed syllable and the candidate syllable match one another. In one aspect, the artificial words are provided for use with a letter-to-sound conversion model, which may be used to generate artificial phonemes from a source of words, such as in conjunction with other models. If the phonemes provided by the various models for a selected source word are in agreement relative to one another, the selected source word and an associated artificial phoneme may be added to a training set which may then be used to retrain the letter-to-sound conversion model.09-24-2009
20090006098Text-to-speech apparatus - According to an aspect of an embodiment, an apparatus for converting text data into sound signal, comprises: a phoneme determiner for determining phoneme data corresponding to a plurality of phonemes and pause data corresponding to a plurality of pauses to be inserted among a series of phonemes in the text data to be converted into sound signal; a phoneme length adjuster for modifying the phoneme data and the pause data by determining lengths of the phonemes, respectively in accordance with a speed of the sound signal and selectively reducing the length of at least one of the pause in the text data to a pause length which is less than the pause length corresponding to the speed of the sound signal; and an output unit for outputting sound signal on the basis of the adjusted phoneme data and pause data by the phoneme length adjuster.01-01-2009
20110054903RICH CONTEXT MODELING FOR TEXT-TO-SPEECH ENGINES - Embodiments of rich text modeling for speech synthesis are disclosed. In operation, a text-to-speech engine refines a plurality of rich context models based on decision tree-tied Hidden Markov Models (HMMs) to produce a plurality of refined rich context models. The text-to-speech engine then generates synthesized speech for an input text based at least on some of the plurality of refined rich context models.03-03-2011
20110264452AUDIO OUTPUT OF TEXT DATA USING SPEECH CONTROL COMMANDS - Example embodiments disclosed herein relate to audio output of speech data using speech control commands. In particular, example embodiments include a mechanism for accessing text data. Example embodiments may also include a mechanism for outputting the text data as audio by converting the text data to speech audio data and transmitting the speech audio data over an audio output. Example embodiments may also include a mechanism for receiving speech control commands that allow for voice control of the output of the audio data.10-27-2011
20110046955SPEECH PROCESSING APPARATUS, SPEECH PROCESSING METHOD AND PROGRAM - There is provided a speech processing apparatus including: a data obtaining unit which obtains music progression data defining a property of one or more time points or one or more time periods along progression of music; a determining unit which determines an output time point at which a speech is to be output during reproducing the music by utilizing the music progression data obtained by the data obtaining unit; and an audio output unit which outputs the speech at the output time point determined by the determining unit during reproducing the music.02-24-2011
20110035222SELECTING FROM A PLURALITY OF AUDIO CLIPS FOR ANNOUNCING MEDIA - Systems and methods for selecting one of several audio clips associated with a text item for playback are provided. The electronic device can determine which audio clip to play back at any point in time using different approaches, including for example receiving a user selection or randomly selecting audio clips. In some embodiments, the electronic device can intelligently select audio clips based on attributes of the media item, the electronic device operations, or the environment of the electronic device. The attributes can include, for example, metadata values of the media item, the type of ongoing operations of the electronic device, and environmental characteristics that can be measured or detected using sensors of or coupled to the electronic device. Different audio clips can be associated with particular attribute values, such that an audio clip corresponding to the detected or received attribute values are played back.02-10-2011
20100324904SYSTEMS AND METHODS FOR MULTIPLE LANGUAGE DOCUMENT NARRATION - Disclosed are techniques and systems to provide a narration of a text in multiple different languages where the portions of the text narrated using the different voices associated with different languages are selected by a user.12-23-2010
20100324903SYSTEMS AND METHODS FOR DOCUMENT NARRATION WITH MULTIPLE CHARACTERS HAVING MULTIPLE MOODS - Disclosed are techniques and systems to provide a narration of a text in multiple different voices. Further disclosed are techniques and systems for providing a plurality of characters at least some of the characters having multiple associated moods for use in document narration.12-23-2010
20120310649SWITCHING BETWEEN TEXT DATA AND AUDIO DATA BASED ON A MAPPING - Techniques are provided for creating a mapping that maps locations in audio data (e.g., an audio book) to corresponding locations in text data (e.g., an e-book). Techniques are provided for using a mapping between audio data and text data, whether the mapping is created automatically or manually. A mapping may be used for bookmark switching where a bookmark established in one version of a digital work (e.g., e-book) is used to identify a corresponding location with another version of the digital work (e.g., an audio book). Alternatively, the mapping may be used to play audio that corresponds to text selected by a user. Alternatively, the mapping may be used to automatically highlight text in response to audio that corresponds to the text being played. Alternatively, the mapping may be used to determine where an annotation created in one media context (e.g., audio) will be consumed in another media context.12-06-2012
20100223058SPEECH SYNTHESIS DEVICE, SPEECH SYNTHESIS METHOD, AND SPEECH SYNTHESIS PROGRAM - A speech synthesis device includes a pitch pattern generation unit (09-02-2010
20100082347SYSTEMS AND METHODS FOR CONCATENATION OF WORDS IN TEXT TO SPEECH SYNTHESIS - Algorithms for synthesizing speech used to identify media assets are provided. Speech may be selectively synthesized form text strings associated with media assets. A text string may be normalized and its native language determined for obtaining a target phoneme for providing human-sounding speech in a language (e.g., dialect or accent) that is familiar to a user. The algorithms may be implemented on a system including several dedicated render engines. The system may be part of a back end coupled to a front end including storage for media assets and associated synthesized speech, and a request processor for receiving and processing requests that result in providing the synthesized speech. The front end may communicate media assets and associated synthesized speech content over a network to host devices coupled to portable electronic devices on which the media assets and synthesized speech are played back.04-01-2010
20100082350METHOD AND SYSTEM FOR PROVIDING SYNTHESIZED SPEECH - An approach providing the efficient use of speech synthesis in rendering text content as audio in a communications network. The communications network can include a telephony network and a data network in support of, for example, Voice over Internet Protocol (VoIP) services. A speech synthesis system receives a text string from either a telephony network, or a data network. The speech synthesis system determines whether a rendered audio file of the text string is stored in a database and to render the text string to output the rendered audio file, if the rendered audio is determined not to exist. The rendered audio file is stored in the database for re-use according to a hash value generated by the speech synthesis system based on the text string.04-01-2010
20100082348SYSTEMS AND METHODS FOR TEXT NORMALIZATION FOR TEXT TO SPEECH SYNTHESIS - Algorithms for synthesizing speech used to identify media assets are provided. Speech may be selectively synthesized form text strings associated with media assets. A text string may be normalized and its native language determined for obtaining a target phoneme for providing human-sounding speech in a language (e.g., dialect or accent) that is familiar to a user. The algorithms may be implemented on a system including several dedicated render engines. The system may be part of a back end coupled to a front end including storage for media assets and associated synthesized speech, and a request processor for receiving and processing requests that result in providing the synthesized speech. The front end may communicate media assets and associated synthesized speech content over a network to host devices coupled to portable electronic devices on which the media assets and synthesized speech are played back.04-01-2010
20100082346SYSTEMS AND METHODS FOR TEXT TO SPEECH SYNTHESIS - Algorithms for synthesizing speech used to identify media assets are provided. Speech may be selectively synthesized form text strings associated with media assets. A text string may be normalized and its native language determined for obtaining a target phoneme for providing human-sounding speech in a language (e.g., dialect or accent) that is familiar to a user. The algorithms may be implemented on a system including several dedicated render engines. The system may be part of a back end coupled to a front end including storage for media assets and associated synthesized speech, and a request processor for receiving and processing requests that result in providing the synthesized speech. The front end may communicate media assets and associated synthesized speech content over a network to host devices coupled to portable electronic devices on which the media assets and synthesized speech are played back.04-01-2010
20100057466METHOD AND APPARATUS FOR SCROLLING TEXT DISPLAY OF VOICE CALL OR MESSAGE DURING VIDEO DISPLAY SESSION - A method and communication device disclosed includes displaying a video on a display, converting voice audio data to textual data by applying voice-to-text conversion, and displaying the textual data as scrolling text displayed along with the video on the display and either above, below or across the video. The method may further include receiving a voice call indication from a network, providing the voice call indication to a user interface where the voice call indication corresponds to an incoming voice call; and receiving a user input for receiving the voice call and displaying the voice call as scrolling text. In another embodiment, a method includes displaying application related data on a display; converting voice audio data to textual data by applying voice-to-text conversion; converting the textual data to a video format; and displaying the textual data as scrolling text over the application related data on said display.03-04-2010
20090299746METHOD AND SYSTEM FOR SPEECH SYNTHESIS - A method for performing speech synthesis to a textual content at a client. The method includes the steps of: performing speech synthesis to the textual content based on a current acoustical unit set S12-03-2009
20110137655SPEECH SYNTHESIS SYSTEM - A speech synthesis system includes a server device and a client device. The server device stores speech element information and speech element identification information in association with each other so that, in a case that speech element information representing respective speech elements included in speech uttered by a speech registering user are arranged in the order of arrangement of the speech elements in the speech, at least one of speech element identification information identifying the respective speech element information has different information from information arranged in accordance with a predetermined rule. The client device transmits speech element identification information to the server device based on accepted text information. The client device executes a speech synthesis process based on the speech element information received from the server device.06-09-2011
20110153330SYSTEM AND METHOD FOR RENDERING TEXT SYNCHRONIZED AUDIO - One or more computing devices include software and/or hardware implemented processing units synchronize a textual content with an audio content, where the textual content is made up of a sequence of textual units and the audio content is made up of a sequence of sound units. The system and/or method matches each of the sequence of sound units with a corresponding textual unit. The system and/or method determines a corresponding time of occurrence for each sound unit in the audio content relative to a time reference. Each matched textual unit is then associated with a tag that corresponds to the time of occurrence for the sound unit matched with the textual unit.06-23-2011
20100030561ANNOTATING PHONEMES AND ACCENTS FOR TEXT-TO-SPEECH SYSTEM - A system that outputs phonemes and accents of texts. The system has a storage section storing a first corpus in which spellings, phonemes, and accents of a text input beforehand are recorded separately for individual segmentations of the words that are contained in the text. A text for which phonemes and accents are to be output is acquired and the first corpus is searched to retrieve at least one set of spellings that match the spellings in the text from among sets of contiguous spellings. Then, the combination of a phoneme and an accent that has a higher probability of occurrence in the first corpus than a predetermined reference probability is selected as the phonemes and accent of the text.02-04-2010
20090048843SYSTEM-EFFECTED TEXT ANNOTATION FOR EXPRESSIVE PROSODY IN SPEECH SYNTHESIS AND RECOGNITION - The inventive system can automatically annotate the relationship of text and acoustic units for the purposes of: (a) predicting how the text is to be pronounced as expressively synthesized speech, and (b) improving the proportion of expressively uttered speech as correctly identified text representing the speaker's message. The system can automatically annotate text corpora for relationships of uttered speech for a particular speaking style and for acoustic units in terms of context and content of the text to the utterances. The inventive system can use kinesthetically defined expressive speech production phonetics that are recognizable and controllable according to kinesensic feedback principles. In speech synthesis embodiments of the invention, the text annotations can specify how the text is to be expressively pronounced as synthesized speech. Also, acoustically-identifying features for dialects or mispronunciations can be identified so as to expressively synthesize alternative dialects or stylistic mispronunciations for a speaker from a given text. In speech recognition embodiments of the invention, each text annotation can be uniquely identified from the corresponding acoustic features of a unit of uttered speech to correctly identify the corresponding text. By employing a method of rules-based text annotation, the invention enables expressiveness to be altered to reflect syntactic, semantic, and/or discourse circumstances found in text to be synthesized or in an uttered message.02-19-2009
20080288256REDUCING RECORDING TIME WHEN CONSTRUCTING A CONCATENATIVE TTS VOICE USING A REDUCED SCRIPT AND PRE-RECORDED SPEECH ASSETS - The present invention discloses a system and a method for creating a reduced script, which is read by a voice talent to create a concatenative text-to-speech (TTS) voice. The method can automatically process pre-recorded audio to derive speech assets for a concatenative TTS voice. The pre-recording audio can include sets of recorded phrases used by a speech user interface (Sill). A set of unfulfilled speech assets needed for foil phonetic coverage of the concatenative TTS voice can be determined. A reduced script can be constructed that includes a set of phrases, which when read by a voice talent result in a reduced corpus. When the reduced corpus is automatically processed, a reduced set of speech assets result. The reduced set includes each of the unfulfilled speech assets. When this reduced corpus is combined with existing speech assets the result will be a voice with a complete set of speech assets.11-20-2008
20080235024METHOD AND SYSTEM FOR TEXT-TO-SPEECH SYNTHESIS WITH PERSONALIZED VOICE - A method and system are provided for text-to-speech synthesis with personalized voice. The method includes receiving an incidental audio input (09-25-2008
20120041765ELECTRONIC BOOK READER AND TEXT TO SPEECH CONVERTING METHOD - An electronic book reader includes a text obtaining module, a text analysis module, a speech synthesis module, a control module, and an audio output device. The text obtaining module is used for obtaining a selected segment of a text. The text analysis module is used for analyzing a time phrase of the selected segment to obtain a waiting time period according to meaning of the time phrase in the selected segment. The speech synthesis module is used for converting the selected segment into speech. The control module is used for sending the content of the selected segment to the speech synthesis module. Wherein the control module waits for the waiting time period after sending the time phrase to the speech synthesis. The audio output module is used for playing the speech.02-16-2012
20090037179Method and Apparatus for Automatically Converting Voice - The invention proposes a method and apparatus for significantly improving the quality of voice morphing and guaranteeing the similarity of converted voice. The invention sets several standard speakers in a TTS database, and selects the voices of different standard speakers for speech synthesis according to different roles, wherein the voice of the selected standard speaker is similar to the original role to a certain extent. Then the invention further performs voice morphing on the standard voice similar to the original voice to a certain extent, in order to accurately mimic the voice of the original speaker, so as to make the converted voice closer to the original voice features while guaranteeing the similarity.02-05-2009
20100057464SYSTEM AND METHOD FOR VARIABLE TEXT-TO-SPEECH WITH MINIMIZED DISTRACTION TO OPERATOR OF AN AUTOMOTIVE VEHICLE - A text-to-speech (TTS) system implemented in an automotive vehicle is dynamically tuned to increase intelligibility over a wide variety of vehicle operating states and environmental conditions by tuning characteristics of the synthesized voice in response to measured operating states. To decrease distractions to an operator of the vehicle, an embodiment of the invention prevents updates to the synthesized voice character from taking effect while a message phrase is being played. Instead, voice characteristics are updated only during natural phrase breaks. In another embodiment of the invention, a damping filter is applied to calculated changes in voice characteristics to prevent excessively rapid changes from being applied, reducing the likelihood of distracting the vehicle operator. In another embodiment of the invention, both phrase-break detectors and damping filters are employed.03-04-2010
20120046949METHOD AND APPARATUS FOR GENERATING AND DISTRIBUTING A HYBRID VOICE RECORDING DERIVED FROM VOCAL ATTRIBUTES OF A REFERENCE VOICE AND A SUBJECT VOICE - A first person narrates a selected written text to generate a reference audio file including one or more parameters are selected from the sounds of the reference audio file, including the duration of a sound, the duration of a pause, the rise and fall of frequency relative to a reference frequency, and/or volume differential between select sounds. A voice profile library contains a phonetic library of sounds spoken by a subject speaker. An integration module generates a preliminary audio file of the selected text in the voice of the subject speaker and then modifies individual sounds by the parameters from the reference file, forming a hybrid audio file. The hybrid audio file retains the tonality of the subject voice, but incorporates the rhythm, cadence and inflections of the reference voice. The reference audio file, and/or the hybrid audio file are licensed or sold as part of a commercial transaction.02-23-2012
20120046948METHOD AND APPARATUS FOR GENERATING AND DISTRIBUTING CUSTOM VOICE RECORDINGS OF PRINTED TEXT - A speech analysis module compares a subject text to the voice of a subject person reciting the text, and generates a personal voice library of the subject's voice. The library includes audio files of actual words spoken by the subject person, as well as morphological, syntactical and grammatical considerations affecting the pronunciation of words and pauses. Words not actually spoken by the subject can be artificially synthesized by an analysis of the subject's speech and pronunciation, and utilizing sounds and portions of words spoken by the subject. Upon request for an audio recording of an object text in the voice of the subject, an integration module retrieves discrete audio files from the personal voice library and artificially generates a voice recording of the object text in the voice of the subject. The generation and transmission of custom audio files can be part of a commercial transaction.02-23-2012
20090063153SYSTEM AND METHOD FOR BLENDING SYNTHETIC VOICES - A system and method for generating a synthetic text-to-speech TTS voice are disclosed. A user is presented with at least one TTS voice and at least one voice characteristic. A new synthetic TTS voice is generated by blending a plurality of existing TTS voices according to the selected voice characteristics. The blending of voices involves interpolating segmented parameters of each TTS voice. Segmented parameters may be, for example, prosodic characteristics of the speech such as pitch, volume, phone durations, accents, stress, mis-pronunciations and emotion.03-05-2009
20120010888Method and System for Speech Synthesis and Advertising Service - Methods and systems for providing a network-accessible text-to-speech synthesis service are provided. The service accepts content as input. After extracting textual content from the input content, the service transforms the content into a format suitable for high-quality speech synthesis. Additionally, the service produces audible advertisements, which are combined with the synthesized speech. The audible advertisements themselves can be generated from textual advertisement content.01-12-2012
20110166861METHOD AND APPARATUS FOR SYNTHESIZING A SPEECH WITH INFORMATION - According to one embodiment, an apparatus for synthesizing a speech, comprises an inputting unit configured to input a text sentence, a text analysis unit configured to analyze the text sentence so as to extract linguistic information, a parameter generation unit configured to generate a speech parameter by using the linguistic information and a pre-trained statistical parameter model, an embedding unit configured to embed information into the speech parameter, and a speech synthesis unit configured to synthesize the speech parameter with the information embedded by the embedding unit into a speech with the information.07-07-2011
20120016675BROADCAST SYSTEM USING TEXT TO SPEECH CONVERSION - A broadcast signal receiver comprises a text data receiver for receiving broadcast text data for display to a user in relation to a user interface; a text-to-speech (TTS) converter for converting received text data into an audio speech signal, the TTS converter being operable to detect whether a word for conversion is included in a stored list of words for conversion and, if so, to convert that word according to a conversion defined by the stored list; and if not, to convert that word according to a set of predetermined conversion rules; a conversion memory storing the list of words for conversion by the TTS converter; and an update receiver for receiving additional words and associated conversions for storage in the conversion memory.01-19-2012
20120158406FACILITATING TEXT-TO-SPEECH CONVERSION OF A USERNAME OR A NETWORK ADDRESS CONTAINING A USERNAME - To facilitate text-to-speech conversion of a username, a first or last name of a user associated with the username may be retrieved, and a pronunciation of the username may be determined based at least in part on whether the name forms at least part of the username. To facilitate text-to-speech conversion of a domain name having a top level domain and at least one other level domain, a pronunciation for the top level domain may be determined based at least in part upon whether the top level domain is one of a predetermined set of top level domains. Each other level domain may be searched for one or more recognized words therewithin, and a pronunciation of the other level domain may be determined based at least in part on an outcome of the search. The username and domain name may form part of a network address such as an email address, URL or URI.06-21-2012
20120109656AUDIO OUTPUT OF A DOCUMENT FROM MOBILE DEVICE - Architecture for playing a document converted into an audio format to a user of an audio-output capable device. The user can interact with the device to control play of the audio document such as pause, rewind, forward, etc. In more robust implementation, the audio-output capable device is a mobile device (e.g., cell phone) having a microphone for processing voice input. Voice commands can then be input to control play (“reading”) of the document audio file to pause, rewind, read paragraph, read next chapter, fast forward, etc. A communications server (e.g., email, attachments to email, etc.) transcodes text-based document content into an audio format by leveraging a text-to-speech (TTS) engine. The transcoded audio files are then transferred to mobile devices through viable transmission channels. Users can then play the audio-formatted document while freeing hand and eye usage for other tasks.05-03-2012
20120072224METHOD OF SPEECH SYNTHESIS - The present invention relates to a method of text-based speech synthesis, wherein at least one portion of a text is specified; the intonation of each portion is determined; target speech sounds are associated with each portion; physical parameters of the target speech sounds are determined; speech sounds most similar in terms of the physical parameters to the target speech sounds are found in a speech database; and speech is synthesized as a sequence of the found speech sounds. The physical parameters of said target speech sounds are determined in accordance with the determined intonation. The present method, when used in a speech synthesizer, allows improved quality of synthesized speech due to precise reproduction of intonation.03-22-2012
20120130718METHOD AND SYSTEM FOR COLLECTING AUDIO PROMPTS IN A DYMANICALLY GENERATED VOICE APPLICATION - A prompt collecting tool (05-24-2012
20110106538SPEECH SYNTHESIS SYSTEM - This speech synthesis system includes a server device and a client device. The client device accepts text information representing text, and transmits a speech element request to the server device. The server device stores speech element information. The server device receives the speech element request transmitted by the client device and, in response to the received speech element request, transmits speech element information to the client device so that the speech element information is received by the client device in a different order from an order of arrangement of speech elements in speech corresponding to the text. The client device executes a speech synthesis process by rearranging the speech element information so that speech elements represented by the received speech element information are arranged in the same order as the order of arrangement of the speech elements in the speech corresponding to the text.05-05-2011
20110106537TRANSFORMING COMPONENTS OF A WEB PAGE TO VOICE PROMPTS - Embodiments of the invention address the deficiencies of the prior art by providing a method, apparatus, and program product to of converting components of a web page to voice prompts for a user. In some embodiments, the method comprises selectively determining at least one HTML component from a plurality of HTML components of a web page to transform into a voice prompt for a mobile system based upon a voice attribute file associated with the web page. The method further comprises transforming the at least one HTML component into parameterized data suitable for use by the mobile system based upon at least a portion of the voice attribute file associated with the at least one HTML component and transmitting the parameterized data to the mobile system.05-05-2011
20100094632System and Method of Developing A TTS Voice - Disclosed herein are various aspects of a toolkit used for generating a TTS voice for use in a spoken dialog system. The embodiments in each case may be in the form of the system, a computer-readable medium or a method for generating the TTS voice. An embodiment of the invention relates to a method of tracking progress in developing a text-to-speech (TTS) voice. The method comprises insuring that a corpus of recorded speech contains reading errors and matches an associated written text, creating a tuple for each utterance in the corpus and tracking progress for each utterance utilizing the tuple. Various parameters may be tracked using the tuple but the tuple provides a means for enabling multiple workers to efficiently process a database of utterance in preparation of a TTS voice.04-15-2010
20120123781TOUCH SCREEN DEVICE FOR ALLOWING BLIND PEOPLE TO OPERATE OBJECTS DISPLAYED THEREON AND OBJECT OPERATING METHOD IN THE TOUCH SCREEN DEVICE - A touch screen device allowing blind people to operate objects displayed thereon and an object operating method in the touch screen device are provided. The touch screen device includes a touch sensing unit generating key values corresponding to touched ‘touch position's of a virtual keyboard for controlling application software being executed, the number of touches and touch time and transmitting the key values to the application software when sensing touches of the virtual keyboard while the virtual keyboard is activated, an object determination unit reading text information of a focused object using hooking mechanism when the application software is executed based on the key values received from the touch sensing unit and the object among objects included in the application software is focused, and a speech synthesis unit converting the text information read by the object determination unit into speech data using a text-to-speech engine and outputting the speech data.05-17-2012
20120166199HOSTED VOICE RECOGNITION SYSTEM FOR WIRELESS DEVICES - Methods, systems, and software for converting the audio input of a user of a hand-held client device or mobile phone into a textual representation by means of a backend server accessed by the device through a communications network. The text is then inserted into or used by an application of the client device to send a text message, instant message, email, or to insert a request into a web-based application or service. In one embodiment, the method includes the steps of initializing or launching the application on the device; recording and transmitting the recorded audio message from the client device to the backend server through a client-server communication protocol; converting the transmitted audio message into the textual representation in the backend server; and sending the converted text message back to the client device or forwarding it on to an alternate destination directly from the server.06-28-2012
20120166198CONTROLLABLE PROSODY RE-ESTIMATION SYSTEM AND METHOD AND COMPUTER PROGRAM PRODUCT THEREOF - In one embodiment of a controllable prosody re-estimation system, a TTS/STS engine consists of a prosody prediction/estimation module, a prosody re-estimation module and a speech synthesis module. The prosody prediction/estimation module generates predicted or estimated prosody information. And then the prosody re-estimation module re-estimates the predicted or estimated prosody information and produces new prosody information, according to a set of controllable parameters provided by a controllable prosody parameter interface. The new prosody information is provided to the speech synthesis module to produce a synthesized speech.06-28-2012
20100250253CONTEXT AWARE, SPEECH-CONTROLLED INTERFACE AND SYSTEM - A speech-directed user interface system includes at least one speaker for delivering an audio signal to a user and at least one microphone for capturing speech utterances of a user. An interface device interfaces with the speaker and microphone and provides a plurality of audio signals to the speaker to be heard by the user. A control circuit is operably coupled with the interface device and is configured for selecting at least one of the plurality of audio signals as a foreground audio signal for delivery to the user through the speaker. The control circuit is operable for recognizing speech utterances of a user and using the recognized speech utterances to control the selection of the foreground audio signal.09-30-2010
20100250254SPEECH SYNTHESIZING DEVICE, COMPUTER PROGRAM PRODUCT, AND METHOD - An acquiring unit acquires pattern sentences, which are similar to one another and include fixed segments and non-fixed segments, and substitution words that are substituted for the non-fixed segments. A sentence generating unit generates target sentences by replacing the non-fixed segments with the substitution words for each of the pattern sentences. A first synthetic-sound generating unit generates a first synthetic sound, a synthetic sound of the fixed segment, and a second synthetic-sound generating unit generates a second synthetic sound, a synthetic sound of the substitution word, for each of the target sentences. A calculating unit calculates a discontinuity value of a boundary between the first synthetic sound and the second synthetic sound for each of the target sentences and a selecting unit selects the target sentence having the smallest discontinuity value. A connecting unit connects the first synthetic sound and the second synthetic sound of the target sentence selected.09-30-2010
20120221340SCRIPTING SUPPORT FOR DATA IDENTIFIERS, VOICE RECOGNITION AND VOICE INPUT IN A TELNET SESSION - Methods of adding data identifiers and speech/voice recognition functionality are disclosed. A telnet client runs one or more scripts that add data identifiers to data fields in a telnet session. The input data is inserted in the corresponding fields based on data identifiers. Scripts run only on the telnet client without modifications to the server applications. Further disclosed are methods for providing speech recognition and voice functionality to telnet clients. Portions of input data are converted to voice and played to the user. A user also may provide input to certain fields of the telnet session by using his voice. Scripts running on the telnet client convert the user's voice into text and is inserted to corresponding fields.08-30-2012
20120221339METHOD, APPARATUS FOR SYNTHESIZING SPEECH AND ACOUSTIC MODEL TRAINING METHOD FOR SPEECH SYNTHESIS - According to one embodiment, a method, apparatus for synthesizing speech, and a method for training acoustic model used in speech synthesis is provided. The method for synthesizing speech may include determining data generated by text analysis as fuzzy heteronym data, performing fuzzy heteronym prediction on the fuzzy heteronym data to output a plurality of candidate pronunciations of the fuzzy heteronym data and probabilities thereof, generating fuzzy context feature labels based on the plurality of candidate pronunciations and probabilities thereof, determining model parameters for the fuzzy context feature labels based on acoustic model with fuzzy decision tree, generating speech parameters from the model parameters, and synthesizing the speech parameters via synthesizer as speech.08-30-2012
20120221338AUTOMATICALLY GENERATING AUDIBLE REPRESENTATIONS OF DATA CONTENT BASED ON USER PREFERENCES - A custom-content audible representation of selected data content is automatically created for a user. The content is based on content preferences of the user (e.g., one or more web browsing histories). The content is aggregated, converted using text-to-speech technology, and adapted to fit in a desired length selected for the personalized audible representation. The length of the audible representation may be custom for the user, and may be determined based on the amount of time the user is typically traveling.08-30-2012
20100082349SYSTEMS AND METHODS FOR SELECTIVE TEXT TO SPEECH SYNTHESIS - Algorithms for synthesizing speech used to identify media assets are provided. Speech may be selectively synthesized form text strings associated with media assets. A text string may be normalized and its native language determined for obtaining a target phoneme for providing human-sounding speech in a language (e.g., dialect or accent) that is familiar to a user. The algorithms may be implemented on a system including several dedicated render engines. The system may be part of a back end coupled to a front end including storage for media assets and associated synthesized speech, and a request processor for receiving and processing requests that result in providing the synthesized speech. The front end may communicate media assets and associated synthesized speech content over a network to host devices coupled to portable electronic devices on which the media assets and synthesized speech are played back.04-01-2010
20120215540METHOD FOR CONVERTING CHARACTER TEXT MESSAGES TO AUDIO FILES WITH RESPECTIVE TITLES FOR THEIR SELECTION AND READING ALOUD WITH MOBILE DEVICES - The present invention relates to a method for selecting and downloading content from a content provider which is accessible via an IP/DNS/URL address to a mobile device, the content being any text information data, for converting the text information data to at least one audio message and for storing the at least one audio message as at least one audio file on the mobile device, wherein the at least one audio file is playable and discernable as a music file. Said method implemented on a mobile phone enables controlling and playing the audio messages as music files, for instance also in a car environment with a car kit enabling a control and a selection of one or more of said at least one audio files for playing from the mobile phone.08-23-2012
20120136664SYSTEM AND METHOD FOR CLOUD-BASED TEXT-TO-SPEECH WEB SERVICES - Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating speech. One variation of the method is from a server side, and another variation of the method is from a client side. The server side method, as implemented by a network-based automatic speech processing system, includes first receiving, from a network client independent of knowledge of internal operations of the system, a request to generate a text-to-speech voice. The request can include speech samples, transcriptions of the speech samples, and metadata describing the speech samples. The system extracts sound units from the speech samples based on the transcriptions and generates an interactive demonstration of the text-to-speech voice based on the sound units, the transcriptions, and the metadata, wherein the interactive demonstration hides a back end processing implementation from the network client. The system provides access to the interactive demonstration to the network client.05-31-2012
20120136665ELECTRONIC DEVICE AND CONTROL METHOD THEREOF - Disclosed are an electronic device and a control method thereof, The electronic device includes a text-to-speech unit which converts a text into an audio signal, an audio output unit which outputs an audio corresponding to the converted audio signal; and a controller which controls the audio output unit to reoutput at least one of audios whose output is not completed if there is at least one audio which is not completely output among a plurality of audios output by the audio output unit.05-31-2012
20110184738NAVIGATION AND ORIENTATION TOOLS FOR SPEECH SYNTHESIS - TTS is a well known technology for decades used for various applications from Artificial Call centers attendants to PC software that allows people with visual impairments or reading disabilities to listen to written works on a home computer. However to date TTS is not widely adopted for PC and Mobile users for daily reading tasks such as reading emails, reading pdf and word documents, reading through website content, and for reading books. The present invention offers new user experience for operating TTS for day to day usage. More specifically this invention describes a synchronization technique for following text being read by TTS engines and specific interfaces for touch pads, touch and multi touch screens. Nevertheless this invention also describes usage of other input methods such as touchpad, mouse, and keyboard.07-28-2011
20110184739COMMUNICATIONS SYSTEM PROVIDING AUTOMATIC TEXT-TO-SPEECH CONVERSION FEATURES AND RELATED METHODS - A communications system may include at least one mobile wireless communications device, and a wireless communications network for sending text messages thereto. More particularly, the at least one mobile wireless communications device may include a wireless transceiver and a controller for cooperating therewith for receiving text messages from the wireless communications network. It may further include a headset output connected to the controller. The controller may be for switching between a normal message mode and an audio message mode based upon a connection between the headset output and a headset. Moreover, when in the audio message mode, the controller may output at least one audio message including speech generated from at least one of the received text messages via the headset output.07-28-2011
20100262426Interactive speech synthesizer for enabling people who cannot talk but who are familiar with use of anonym moveable picture communication to autonomously communicate using verbal language - A method for enabling a person, who cannot talk but who is familiar with use of anonym moveable picture communication, to autonomously communicate speech sound automatically in a sequence. The method includes the steps of choosing a plurality of selected encoded tags—each having apparatus for it to be identified by the person, providing an interactive speech synthesizer, arranging each of the plurality of the selected encoded tags to be movable between a ready mode wherein it is proximate to an associated tag reader and a go mode wherein it is in operative association with the associated tag reader, providing in the go mode for each of the plurality of the tag reader to read data from its associated one of the selected plurality of the encoded tags in the sequence to provide a series of coded signals, transmitting the series of coded signals to a microcontroller, causing the microcontroller to organize a sound file corresponding to the series of coded signals, and transmitting the sound file to an audio output device to convert the sound file automatically into the speech sound.10-14-2010
20100174544SYSTEM, METHOD AND END-USER DEVICE FOR VOCAL DELIVERY OF TEXTUAL DATA - System and method for receiving documents of different formats from external sources, analyzing the documents and transforming them into an internal format comprising tokens for effective browsing and referencing, communicating data volumes of transformed documents to a user device, browsing and vocalizing tokens from the documents to the user, receiving and processing verbal user commands pertaining to said vocalized tokens, retrieving documents pertaining to the user command and vocalizing the retrieved documents to said user.07-08-2010
20120253816TEXT-TO-SPEECH USER'S VOICE COOPERATIVE SERVER FOR INSTANT MESSAGING CLIENTS - A system and method to allow an author of an instant message to enable and control the production of audible speech to the recipient of the message. The voice of the author of the message is characterized into parameters compatible with a formative or articulative text-to-speech engine such that upon receipt, the receiving client device can generate audible speech signals from the message text according to the characterization of the author's voice. Alternatively, the author can store samples of his or her actual voice in a server so that, upon transmission of a message by the author to a recipient, the server extracts the samples needed only to synthesize the words in the text message, and delivers those to the receiving client device so that they are used by a client-side concatenative text-to-speech engine to generate audible speech signals having a close likeness to the actual voice of the author.10-04-2012
20120253814SYSTEM AND METHOD FOR WEB TEXT CONTENT AGGREGATION AND PRESENTATION - A system and method for aggregating text-based content and presenting the text-based content as spoken audio is described herein, where a server module retrieves and aggregates web content from web content providers that may include text-based web content that is then extracted, filtered and categorizes for a client module to retrieve and play as spoken audio.10-04-2012
20120173242SYSTEM AND METHOD FOR EXCHANGE OF SCRIBBLE DATA BETWEEN GSM DEVICES ALONG WITH VOICE - A method for transferring scribble data along with voice includes connecting at least two electronic devices through a GSM network, accumulating and down sampling the scribble coordinates, which are converted to a speech-like signal that is sent along with voice data packets simultaneously in the GSM network.07-05-2012
20120173241MULTI-LINGUAL TEXT-TO-SPEECH SYSTEM AND METHOD - A multi-lingual text-to-speech system and method processes a text to be synthesized via an acoustic-prosodic model selection module and an acoustic-prosodic model mergence module, and obtains a phonetic unit transformation table. In an online phase, the acoustic-prosodic model selection module, according to the text and a phonetic unit transcription corresponding to the text, uses at least a set controllable accent weighting parameter to select a transformation combination and find a second and a first acoustic-prosodic models. The acoustic-prosodic model mergence module merges the two acoustic-prosodic models into a merged acoustic-prosodic model, according to the at least a controllable accent weighting parameter, processes all transformations in the transformation combination and generates a merged acoustic-prosodic model sequence. A speech synthesizer and the merged acoustic-prosodic model sequence are further applied to synthesize the text into an L1-accent L2 speech.07-05-2012
20120316881SPEECH SYNTHESIZER, SPEECH SYNTHESIS METHOD, AND SPEECH SYNTHESIS PROGRAM - A normalized spectrum storage unit 12-13-2012
20120179468Automatic Dominant Orientation Estimation In Text Images Based On Steerable Filters - Briefly, in accordance with one or more embodiments, an image processing system is capable of receiving an image containing text, applying optical character recognition to the image, and then audibly reproducing the text via text-to-speech synthesis. Prior to optical character recognition, an orientation corrector is capable of detecting an amount of angular rotation of the text in the image with respect to horizontal, and then rotating the image by an appropriate amount to sufficiently align the text with respect to horizontal for optimal optical character recognition. The detection may be performed using steerable filters to provide an energy versus orientation curve of the image data. A maximum of the energy curve may indicate the amount of angular rotation that may be corrected by the orientation corrector.07-12-2012
20090089061Audio Reader Device - An audio reader device for reading printed infrared media includes a linear sensor device sensitive to infra-red. A processor is operatively connected to the sensor device and is configured to read and decode infra-red audio data on the media. A memory is operatively connected to the processor for storing the audio data. A sound processing integrated circuit and speaker arrangement is operatively connected to the memory for playback of the audio data. A roller arrangement feeds the media past the linear sensor device.04-02-2009
20100010816FACILITATING TEXT-TO-SPEECH CONVERSION OF A USERNAME OR A NETWORK ADDRESS CONTAINING A USERNAME - To facilitate text-to-speech conversion of a username, a first or last name of a user associated with the username may be retrieved, and a pronunciation of the username may be determined based at least in part on whether the name forms at least part of the username. To facilitate text-to-speech conversion of a domain name having a top level domain and at least one other level domain, a pronunciation for the top level domain may be determined based at least in part upon whether the top level domain is one of a predetermined set of top level domains. Each other level domain may be searched for one or more recognized words therewithin, and a pronunciation of the other level domain may be determined based at least in part on an outcome of the search. The username and domain name may form part of a network address such as an email address, URL or URI.01-14-2010
20090306986Method and system for providing speech synthesis on user terminals over a communications network - Service architecture for providing to a user terminal of a communications network textual information and relative speech synthesis, the user terminal being provided with a speech synthesis engine and a basic database of speech waveforms includes: a content server for downloading textual information requested by means of a browser application on the user terminal; a context manager for extracting context information from the textual information requested by the user terminal; a context selector for selecting an incremental database of speech waveforms associated with extracted context information and for downloading the incremental database into the user terminal; a database manager on the user terminal for managing the composition of an enlarged database of speech waveforms for the speech synthesis engine including the basic and the incremental databases of speech waveforms.12-10-2009
20090018838Media interface - Provided are a user interface for processing digital data, a method for processing a media interface, and a recording medium thereof. The user interface is used for converting a selected script into voice to generate digital data having a form of a voice file corresponding to the script, or for managing the generated digital data. In the method, the user interface is displayed. The user interface includes at least a text window on which a script to be converted into voice is written, and an icon to be selected for converting the script written on the text window into voice.01-15-2009
20080300882Methods and Apparatus for Conveying Synthetic Speech Style from a Text-to-Speech System - A technique for producing speech output in a text-to-speech system is provided. A message is created for communication to a user in a natural language generator of the text-to-speech system. The message is annotated in the natural language generator with a synthetic speech output style. The message is conveyed to the user through a speech synthesis system in communication with the natural language generator, wherein the message is conveyed in accordance with the synthetic speech output style.12-04-2008
20120239405SYSTEM AND METHOD FOR GENERATING AUDIO CONTENT - A system and method for generating audio content. Content is automatically retrieved from a website. The content is converted to audio files. The audio files are associated with a hierarchy. The hierarchy is determined from the website. One or more audio files are communicated to an electronic device utilized by a user in response to a request from the user.09-20-2012
20120239404APPARATUS AND METHOD FOR EDITING SPEECH SYNTHESIS, AND COMPUTER READABLE MEDIUM - An acquisition unit analyzes a text, and acquires phonemic and prosodic information. An editing unit edits a part of the phonemic and prosodic information. A speech synthesis unit converts the phonemic and prosodic information before editing the part to a first speech waveform, and converts the phonemic and prosodic information after editing the part to a second speech waveform. A period calculation unit calculates a contrast period corresponding to the part in the first speech waveform and the second speech waveform. A speech generation unit generates an output waveform by connecting a first partial waveform and a second partial waveform. The first partial waveform contains the contrast period of the first speech waveform. The second partial waveform contains the contrast period of the second speech waveform.09-20-2012
20120265533VOICE ASSIGNMENT FOR TEXT-TO-SPEECH OUTPUT - Text can be obtained at a device from various forms of communication such as e-mails or text messages. Metadata can be obtained directly from the communication or from a secondary source identified by the directly obtained metadata. The metadata can be used to create a speaker profile. The speaker profile can be used to select voice data. The selected voice data can be used by a text-to-speech (TTS) engine to produce speech output having voice characteristics that best match the speaker profile.10-18-2012
20120265532System For Natural Language Assessment of Relative Color Quality - Embodiments of the invention include a system for providing a natural language objective assessment of relative color quality between a reference and a source image. The system may include a color converter that receives a difference measurement between the reference image and source image and determines a color attribute change based on the difference measurement. The color attributes may include hue shift, saturation changes, and color variation, for instance. Additionally, a magnitude index facility determines a magnitude of the determined color attribute change. Further, a natural language selector maps the color attribute change and the magnitude of the change to natural language and generates a report of the color attribute change and the magnitude of the color attribute change. The output can then be communicated to a user in either text or audio form, or in both text and audio forms.10-18-2012
20110046956System And Method For Improved Dynamic Allocation Of Application Resources - A self-help application platform such as one hosting an interactive voice response (IVR) has a browser that executes application scripts to implement the self-help application. The execution of the application scripts is performed by utilizing various application resources, such as media conversions from text to speech (TTS) and speech to text (automatic speech recognition ASR) and other media servers. The platform is provided with a dynamic resource selection mechanism in which the application is executed with an updated optimum set of application resources distributed over different locations. The selection is based on the profiles of the browser, users, route, and quality of service. The selection is further modulated by the browser's previous experiences with the individual resources. The selection is made dynamically during the executing of the application script.02-24-2011
20120323578Text-to-Speech Device and Text-to-Speech Method - A sound control section (12-20-2012
20110238420METHOD AND APPARATUS FOR EDITING SPEECH, AND METHOD FOR SYNTHESIZING SPEECH - According to one embodiment, a method for editing speech is disclosed. The method can generate speech information from a text. The speech information includes phonologic information and prosody information. The method can divide the speech information into a plurality of speech units, based on at least one of the phonologic information and the prosody information. The method can search at least two speech units from the plurality of speech units. At least one of the phonologic information and the prosody information in the at least two speech units are identical or similar. In addition, the method can store a speech unit waveform corresponding to one of the at least two speech units as a representative speech unit into a memory.09-29-2011
20120278081TEXT TO SPEECH METHOD AND SYSTEM - A text-to-speech method for use in a plurality of languages, including: inputting text in a selected language; dividing the inputted text into a sequence of acoustic units; converting the sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein the model has a plurality of model parameters describing probability distributions which relate an acoustic unit to a speech vector; and outputting the sequence of speech vectors as audio in the selected language. A parameter of a predetermined type of each probability distribution in the selected language is expressed as a weighted sum of language independent parameters of the same type. The weighting used is language dependent, such that converting the sequence of acoustic units to a sequence of speech vectors includes retrieving the language dependent weights for the selected language.11-01-2012
20120278082COMBINING WEB BROWSER AND AUDIO PLAYER FUNCTIONALITY TO FACILITATE ORGANIZATION AND CONSUMPTION OF WEB DOCUMENTS - The invention is directed to combining web browser and audio player functionality for the organization and consumption of web documents. Specifically, the invention identifies a set of web documents via a web browser, extracts content from the web documents, and adds the set of web documents to a playlist. In this way, users can build a playlist of web documents and utilize the functionality and convenience of an audio player and listen to the content of the playlist.11-01-2012
20120089402SPEECH SYNTHESIZER, SPEECH SYNTHESIZING METHOD AND PROGRAM PRODUCT - According to one embodiment, a speech synthesizer includes an analyzer, a first estimator, a selector, a generator, a second estimator, and a synthesizer. The analyzer analyzes text and extracts a linguistic feature. The first estimator selects a first prosody model adapted to the linguistic feature and estimates prosody information that maximizes a first likelihood representing probability of the selected first prosody model. The selector selects speech units that minimize a cost function determined in accordance with the prosody information. The generator generates a second prosody model that is a model of the prosody information of the speech units. The second estimator estimates prosody information that maximizes a third likelihood calculated on the basis of the first likelihood and a second likelihood representing probability of the second prosody model. The synthesizer generates synthetic speech by concatenating the speech units on the basis of the prosody information estimated by the second estimator.04-12-2012
20120089401METHODS AND APPARATUS TO AUDIBLY PROVIDE MESSAGES IN A MOBILE DEVICE - Methods and apparatus to audibly provide messages in a mobile device at described. An example method includes receiving a message at a mobile device, wherein the message includes an identification of a sender, an identification of a recipient, and a message contents, determining that the message contents includes a predetermined phrase, in response to determining that the message contents includes the predetermined phrase, audibly presenting the message contents.04-12-2012
20120089400SYSTEMS AND METHODS FOR USING HOMOPHONE LEXICONS IN ENGLISH TEXT-TO-SPEECH - The present invention relates to information systems. More specifically, the present invention relates to infrastructure and techniques for improving Text-to-Speech-enabled applications.04-12-2012
20110276332SPEECH PROCESSING METHOD AND APPARATUS - A speech synthesis method comprising: 11-10-2011
20120330665PRESCRIPTION LABEL READER - A system is configured to read a prescription label and output audio information corresponding to prescription information present on or linked to the prescription label. The system may have knowledge about prescription labels and prescription information, and use this knowledge to present the audio information in a structured form to the user.12-27-2012
20120330666METHOD, SYSTEM AND PROCESSOR-READABLE MEDIA FOR AUTOMATICALLY VOCALIZING USER PRE-SELECTED SPORTING EVENT SCORES - A method and system for vocalizing user-selected sporting event scores. A customized spoken score application module can be configured in association with a device. A real-time score can be preselected by a user from an existing sporting event website for automatically vocalizing the score in a multitude of languages utilizing a speech synthesizer and a translation engine. An existing text-to-speech engine can be integrated with the spoken score application module and controlled by the application module to automatically vocalize the preselected scores listed on the sporting event site. The synthetically-voiced, real-time score can be transmitted to the device at a predetermined time interval. Such an approach automatically and instantly pushes the real time vocal alerts thereby permitting the user to continue multitasking without activating the pre-selected vocal alerts.12-27-2012
20120330667SPEECH SYNTHESIZER, NAVIGATION APPARATUS AND SPEECH SYNTHESIZING METHOD - Included in a speech synthesizer, a natural language processing unit divides text data, input from a text input unit, into a plurality of components (particularly, words). An importance prediction unit estimates an importance level of each component according to the degree of how much each component contributes to understanding when a listener hears synthesized speech. Then, the speech synthesizer determines a processing load based on the device state when executing synthesis processing and the importance level. Included in the speech synthesizer, a synthesizing control unit and a wave generation unit reduce the processing time for a phoneme with a low importance level by curtailing its processing load (relatively degrading its sound quality), allocate a part of the processing time, made available by this reduction, to the processing time of a phoneme with a high importance level, and generates synthesized speech in which important words are easily audible.12-27-2012
20120330668AUTOMATED METHOD AND SYSTEM FOR OBTAINING USER-SELECTED REAL-TIME INFORMATION ON A MOBILE COMMUNICATION DEVICE - A customized live tile application module can be configured in association with the mobile communication device in order to automatically vocalize the real-time information preselected by a user in a multitude of languages. A text-to-speech application module can be integrated with the customized live tile application module to automatically vocalize the preselected real-time information. The real-time information can be obtained from a tile and/or a website integrated with a remote server and announced after a text to speech conversion process without opening the tile, if the tiles are selected for announcement of information by the device. Such an approach automatically and instantly pushes a vocal alert with respect to the user-selected real-time information on the mobile communication device thereby permitting the user to continue multitasking. Information from tiles can also be rendered on second screens from a mobile device.12-27-2012
20110320207CODING, MODIFICATION AND SYNTHESIS OF SPEECH SEGMENTS - The invention relates to a method for speech signal analysis, modification and synthesis comprising a phase for the location of analysis windows by means of an iterative process for the determination of the phase of the first sinusoidal component and comparison between the phase value of said component and a predetermined value, a phase for the selection of analysis frames corresponding to an allophone and readjustment of the duration and the fundamental frequency according to certain thresholds and a phase for the generation of synthetic speech from synthesis frames taking the information of the closest analysis frame as spectral information of the synthesis frame and taking as many synthesis frames as periods that the synthetic signal has. The method allows a coherent location of the analysis windows within the periods of the signal and the exact generation of the synthesis instants in a manner synchronous with the fundamental period.12-29-2011
20110320206ELECTRONIC BOOK READER AND TEXT TO SPEECH CONVERTING METHOD - An electronic book reader includes a text obtaining module, a text highlighting module, a speech synthesis module, a player module, and a synchronization control module. The text obtaining module obtains a selected segment of a text. The text highlighting module highlights the selected segment. The speech synthesis module converts the selected segment into a speech. The player module plays the speech. The synchronization control module sends the selected segment to the text highlighting module and speech synthesis module synchronously.12-29-2011
20110320205ELECTRONIC BOOK READER - An electronic book reader includes a display, an audio output device, a text obtaining module, a storing module, a text displaying module, a text analyzing module, a text highlighting module, a speech synthesis module, a player module, and a synchronization control module. The text obtaining module obtains a text from a text source. The storing module stores the text. The text displaying module displays the text on the display. The text analyzing module divides the text into a plurality of segments according to punctuations of the text, and read a selected segment. The speech synthesis module converts the selected segment into speech. The synchronization control module sends a command to the text analyzing module for reading the segment, and sends the segment to the text highlighting module and speech synthesis module synchronously.12-29-2011
20110320204SYSTEMS AND METHODS FOR INPUT DEVICE AUDIO FEEDBACK - Systems, methods, apparatuses and computer program products configured to provide sound feedback for input devices are described. Embodiments take input from a digitizer, such as input using as stylus/pen, and produce sound feedback to enhance the user's input interface experience. Embodiments thus provide a user with a more realistic interface with an electronic device, emulating use of conventional writing implements.12-29-2011
20120290304Electronic Holder for Reading Books - A book support and optical scanner assembly for converting printed text to an audio output includes a support for supporting an open book and a pair of optical scanners adapted to scan opposite pages. The assembly also includes means for moving the scanners from the top of the page to the bottom of a page. Further, both scanners can be rotated off of the book for turning a page. In addition, the assembly includes a text to audio converter for converting the scanned text into spoken words and in one embodiment a translator to translate the scanned text into a pre-selected language.11-15-2012
20100198599DISPLAY APPARATUS - In a display apparatus, a text code input section outputs externally-supplied text code information to a font conversion section and a voice synthesizer section. The font conversion section converts the input text code into a corresponding font, and transmits the font to a display drive section via a video signal input section, and the display drive section causes a display section to display the font. Meanwhile, the voice synthesizer section converts the input text code into corresponding voice data, and transmits the voice data to a voice device where the voice data is outputted. With this structure, superior convenience is ensured for a display apparatus which serves only as an individual displaying apparatus and relies on an external device (server) for the major functions of the system.08-05-2010
20090076821METHOD AND APPARATUS TO CONTROL OPERATION OF A PLAYBACK DEVICE - Media metadata is accessible for a plurality of media items (See FIG. 03-19-2009
20100174545INFORMATION PROCESSING APPARATUS AND TEXT-TO-SPEECH METHOD - An information processing apparatus for playing back data includes an oral reading unit, a storage unit storing text templates for responses to questions from a user and text template conversion rules, an input unit for inputting a question from a user, and a control unit for retrieving data and items of information associated with the data. The control unit analyzes a question about a data from a user, for example, a question about a tune, to select a text template for a response to the question and detects the characters in items of tune information of the tune. The characters are designated to replace replacement symbols included in the text template. The control unit also converts the text template based on whether the characters can be read aloud, generates a text to be read aloud using the converted text template, and causes the oral reading unit to read the text aloud.07-08-2010
20100169096Instant communication with instant text data and voice data - Embodiments of the invention relate to an instant communication method, an instant communication server, a speech server and a system thereof. The instant communication method includes: receiving, by a speech server, text data sent via instant communication software by a first user terminal; transforming, by the speech server, the text data into first speech data; sending, by the speech server, the first speech data via a preconfigured phone number to a corresponding second user terminal; receiving, by the speech server, second speech data sent by the second user terminal; and sending, by the speech server, the second speech data to the first user terminal via the instant communication software. Using embodiments of the invention, website owners can communicate with visitors via a mobile phone or a fixed telephone anytime and anywhere, which may improve the reception of Internet marketing, reduce prerequisite requirements for e-commerce; and connect the Internet and the telecommunication network.07-01-2010
20130013314MOBILE COMPUTING APPARATUS AND METHOD OF REDUCING USER WORKLOAD IN RELATION TO OPERATION OF A MOBILE COMPUTING APPARATUS - A mobile computing apparatus comprises a processing resource arranged to support, when in use, an operational environment, the operational environment supporting receipt of textual content, a workload estimator arranged to estimate a cognitive workload for a user, and a text-to-speech engine. The text-to-speech engine is arranged to translate at least part of the received textual content to a signal reproducible as audible speech in accordance with a predetermined relationship between the amount of the textual content to be translated and a cognitive workload level in a range of cognitive workload levels, the range of cognitive workload levels comprising at least one cognitive workload level between end values.01-10-2013
20130013313STATISTICAL ENHANCEMENT OF SPEECH OUTPUT FROM A STATISTICAL TEXT-TO-SPEECH SYNTHESIS SYSTEM - A method, system and computer program product are provided for enhancement of speech synthesized by a statistical text-to-speech (TTS) system employing a parametric representation of speech in a space of acoustic feature vectors. The method includes: defining a parametric family of corrective transformations operating in the space of the acoustic feature vectors and dependent on a set of enhancing parameters; and defining a distortion indictor of a feature vector or a plurality of feature vectors. The method further includes: receiving a feature vector output by the system; and generating an instance of the corrective transformation by: calculating a reference value of the distortion indicator attributed to a statistical model of the phonetic unit emitting the feature vector; calculating an actual value of the distortion indicator attributed to feature vectors emitted by the statistical model of the phonetic unit emitting the feature vector; calculating the enhancing parameter values depending on the reference value of the distortion indicator, the actual value of the distortion indicator and the parametric corrective transformation; and deriving an instance of the corrective transformation corresponding to the enhancing parameter values from the parametric family of the corrective transformations. The instance of the corrective transformation may be applied to the feature vector to provide an enhanced feature vector.01-10-2013
20130018658DYNAMICALLY EXTENDING THE SPEECH PROMPTS OF A MULTIMODAL APPLICATION - A prompt generation engine operates to dynamically extend prompts of a multimodal application. The prompt generation engine receives a media file having a metadata container. The prompt generation engine operates on a multimodal device that supports a voice mode and a non-voice mode for interacting with the multimodal device. The prompt generation engine retrieves from the metadata container a speech prompt related to content stored in the media file for inclusion in the multimodal application. The prompt generation engine modifies the multimodal application to include the speech prompt.01-17-2013
20110161085METHOD AND APPARATUS FOR AUDIO SUMMARY OF ACTIVITY FOR USER - Techniques for audio summary of activity for a user include tracking activity at one or more network sources associated with a user. One audio stream that summarizes the activity over a particular time period is generated. The audio stream is caused to be delivered to a particular device associated with the user. A duration of a complete rendering of the audio stream is shorter than the particular time period. In some embodiments, a link to content related to at least a portion of the audio stream is also caused to be delivered for a user.06-30-2011
20080235025PROSODY MODIFICATION DEVICE, PROSODY MODIFICATION METHOD, AND RECORDING MEDIUM STORING PROSODY MODIFICATION PROGRAM - A prosody modification device includes: a real voice prosody input part that receives real voice prosody information extracted from an utterance of a human; a regular prosody generating part that generates regular prosody information having a regular phoneme boundary that determines a boundary between phonemes and a regular phoneme length of a phoneme by using data representing a regular or statistical phoneme length in an utterance of a human with respect to a section including at least a phoneme or a phoneme string to be modified in the real voice prosody information; and a real voice prosody modification part that resets a real voice phoneme boundary by using the generated regular prosody information so that the real voice phoneme boundary and a real voice phoneme length of the phoneme or the phoneme string to be modified in the real voice prosody information are approximate to an actual phoneme boundary and an actual phoneme length of the utterance of the human, thereby modifying the real voice prosody information.09-25-2008
20080221894SYNTHESIZING SPEECH FROM TEXT - Speech is synthesized for a given text by determining a sequence of phonetic components based on the text, determining a sequence of target phonetic elements associated phonetic components, determining a sequence of target event types associated with the phonetic components and determining a sequence of speech units from a plurality of stored speech unit candidates by use of a cost function. The cost function comprises a unit cost, a concatenation cost, and an event type cost for each speech unit in the sequence of speech units. The unit cost of a speech unit is determined with respect to the corresponding target phonetic element, while the concatenation cost of a speech unit is determined with respect to adjacent speech units and the event type cost of each speech unit is determined with respect to the corresponding target event type.09-11-2008
20130144624SYSTEM AND METHOD FOR LOW-LATENCY WEB-BASED TEXT-TO-SPEECH WITHOUT PLUGINS - Disclosed herein are systems, methods, and non-transitory computer-readable storage media for reducing latency in web-browsing TTS systems without the use of a plug-in or Flash® module. A system configured according to the disclosed methods allows the browser to send prosodically meaningful sections of text to a web server. A TTS server then converts intonational phrases of the text into audio and responds to the browser with the audio file. The system saves the audio file in a cache, with the file indexed by a unique identifier. As the system continues converting text into speech, when identical text appears the system uses the cached audio corresponding to the identical text without the need for re-synthesis via the TTS server.06-06-2013
20130179170CROWD-SOURCING PRONUNCIATION CORRECTIONS IN TEXT-TO-SPEECH ENGINES - Technologies are described herein for providing validated text-to-speech correction hints from aggregated pronunciation corrections received from text-to-speech applications. A number of pronunciation corrections are received by a Web service. The pronunciation corrections may be provided by users of text-to-speech applications executing on a variety of user computer systems. Each of the plurality of pronunciation corrections includes a specification of a word or phrase and a suggested pronunciation provided by the user. The pronunciation corrections are analyzed to generate validated correction hints, and the validated correction hints are provided back to the text-to-speech applications to be used to correct pronunciation of words and phrases in the text-to-speech applications.07-11-2013
20120253815TALKING PAPER AUTHORING TOOLS - A range of unified software authoring tools for creating a talking paper application for integration in an end user platform are described herein. The authoring tools are easy to use and are interoperable to provide an easy and cost-effective method of creating a talking paper application. The authoring tools provide a framework for creating audio content and image content and interactively linking the audio content and the image content. The authoring tools also provide for verifying the interactively linked audio and image content, reviewing the audio content, the image content and the interactive linking on a display device. Finally, the authoring tools provide for saving the audio content, the video content and the interactive linking for publication to a manufacturer for integration in an end user platform or talking paper platform.10-04-2012
20130096920FACILITATING TEXT-TO-SPEECH CONVERSION OF A USERNAME OR A NETWORK ADDRESS CONTAINING A USERNAME - To facilitate text-to-speech conversion of a username, a first or last name of a user associated with the username may be retrieved, and a pronunciation of the username may be determined based at least in part on whether the name forms at least part of the username. To facilitate text-to-speech conversion of a domain name having a top level domain and at least one other level domain, a pronunciation for the top level domain may be determined based at least in part upon whether the top level domain is one of a predetermined set of top level domains. Each other level domain may be searched for one or more recognized words therewithin, and a pronunciation of the other level domain may be determined based at least in part on an outcome of the search. The username and domain name may form part of a network address such as an email address, URL or URI.04-18-2013
20130096921INFORMATION PROVIDING SYSTEM AND VEHICLE-MOUNTED APPARATUS - A portable terminal apparatus is configured to obtain provided information including character data from an information distribution server apparatus, transmit partial data, which is a portion of the character data, to a voice synthesizing server apparatus, and obtain voice data obtained by converting the partial data into voice from the voice synthesizing server apparatus, and when a predetermined notification is received from a vehicle-mounted apparatus, a command is given to cause the vehicle-mounted apparatus to display the provided information corresponding to the voice data, and the vehicle-mounted apparatus displays information given by the portable terminal apparatus, plays the voice data, and when selection operation performed by a user is received, the portable terminal apparatus is notified that the selection operation has been performed.04-18-2013
20110313772SYSTEM AND METHOD FOR UNIT SELECTION TEXT-TO-SPEECH USING A MODIFIED VITERBI APPROACH - Disclosed herein are systems, methods, and non-transitory computer-readable storage media for speech synthesis. A system practicing the method receives a set of ordered lists of speech units, for each respective speech unit in each ordered list in the set of ordered lists, constructs a sublist of speech units from a next ordered list which are suitable for concatenation, performs a cost analysis of paths through the set of ordered lists of speech units based on the sublist of speech units for each respective speech unit, and synthesizes speech using a lowest cost path of speech units through the set of ordered lists based on the cost analysis. The ordered lists can be ordered based on the respective pitch of each speech unit. In one embodiment, speech units which do not have an assigned pitch can be assigned a pitch.12-22-2011
20130132087AUDIO INTERFACE - Methods, systems, and apparatus are generally described for providing an audio interface.05-23-2013
20130144625SYSTEMS AND METHODS DOCUMENT NARRATION - Disclosed are techniques and systems to provide a narration of a text in multiple different voices. In some aspects, systems and methods described herein can include receiving a user-based selection of a first portion of words in a document where the document has a pre-associated first voice model and overwriting the association of the first voice model, by the one or more computers, with a second voice model for the first portion of words.06-06-2013
20110218809VOICE SYNTHESIS DEVICE, NAVIGATION DEVICE HAVING THE SAME, AND METHOD FOR SYNTHESIZING VOICE MESSAGE - A voice synthesis device includes: a memory for storing a plurality of recorded voice data; a dividing unit for dividing a text into a plurality of words or phrases, wherein the text is to be converted into a voice message; a verifying unit for verifying whether one of the recorded voice data corresponding to each word or phrase is disposed in the memory; and a voice synthesizing unit for preparing a whole of the text with the recorded voice data when all of the recorded voice data corresponding to all of the plurality of words or phrases are disposed in the memory, and for preparing the whole of the text with rule-based synthesized voice data when at least one of the recorded voice data corresponding to one of the plurality of words or phrases is not disposed in the memory.09-08-2011
20080201149VARIABLE VOICE RATE APPARATUS AND VARIABLE VOICE RATE METHOD - A variable voice rate apparatus to control a reproduction rate of voice, includes a voice data generation unit configured to generate voice data from the voice, a text data generation unit configured to generate text data indicating a content of the voice data, a division information generation unit configured to generate division information used for dividing the text data into a plurality of linguistic units each of which is characterized by a linguistic form, a reproduction information generation unit configured to generate reproduction information set for each of the linguistic units, and a voice reproduction controller which controls reproduction of each of the linguistic units, based on the reproduction information and the division information.08-21-2008
20110238421Speech Output Device, Control Method For A Speech Output Device, Printing Device, And Interface Board - A speech output device, a control method for a speech output device, a printer, and an interface board can improve the productivity of foreign language speaking workers in industries such as retailing and food services. A data communication unit 09-29-2011
20100299149Character Models for Document Narration - Disclosed are techniques and systems to provide a narration of a text in multiple different voices where the portions of the text narrated using the different voices are selected by a user. Also disclosed are techniques and systems for associating characters with portions of a sequence of words selected by a user. Different characters having different voice models can be associated with different portions of a sequence of words.11-25-2010
20130204623ELECTRONIC APPARATUS AND FUNCTION GUIDE METHOD THEREOF - In an electronic apparatus having a plurality of functions, a connecting unit connects the electronic apparatus to an external device which presents text information in a form recognizable by a visually impaired user. A function selection unit selects a function to be executed. A storage unit stores a table defining correspondence between the plurality of functions and a plurality of text files each containing text information. A text file selection unit selects a text file corresponding to the selected function with reference to the table. An acquisition unit acquires file information from the selected text file. A transmission unit transmits the acquired file information to the external device.08-08-2013
20130204624CONTEXTUAL CONVERSION PLATFORM FOR GENERATING PRIORITIZED REPLACEMENT TEXT FOR SPOKEN CONTENT OUTPUT - A contextual conversion platform, and method for converting text-to-speech, are described that can convert content of a target to spoken content. Embodiments of the contextual conversion platform can identify certain contextual characteristics of the content, from which can be generated a spoken content input. This spoken content input can include tokens, e.g., words and abbreviations, to be converted to the spoken content, as well as substitution tokens that are selected from contextual repositories based on the context identified by the contextual conversion platform.08-08-2013
20100318363SYSTEMS AND METHODS FOR PROCESSING INDICIA FOR DOCUMENT NARRATION - Disclosed are techniques and systems to provide a narration of a text in multiple different voices. Further disclosed are techniques and systems for processing indicia in a document to determine a portion of words and associating a particular a voice model with the portion of words based on the indicia. During a readback process, an audible output corresponding to the words in the portion of words is generated using the voice model associated with the portion of words.12-16-2010
20100318362Systems and Methods for Multiple Voice Document Narration - Disclosed are techniques and systems to provide a narration of a text in multiple different voices where the portions of the text narrated using the different voices are selected by a user.12-16-2010
20100318361Context-Relevant Images - Assistive, context-relevant images may be provided. First, text may be received. Then a spell check indication may be received and a spelling check may be performed on the received text in response to the received spell check indication. Next, in response to the performed spelling check, a misspelling indication may be provided configured to indicate that at least one word in the received text is misspelled. A selection of the misspelling indication may then be received. Then, on a display device in response to the received selection of the misspelling indication, a plurality of suggested spellings for the at least one word and an image corresponding to a first one of the plurality of suggested spellings for the at least one word may be displayed.12-16-2010
20100318360METHOD AND SYSTEM FOR EXTRACTING MESSAGES - The present invention is a method and system for extracting messages from a person using the body features presented by a user. The present invention captures a set of images and extracts a first set of body features, along with a set of contexts, and a set of meanings. From the first set of body features, the set of contexts, and the set of meanings, the present invention generates a set of words corresponding to the message that the person is attempting to convey. The present invention can also use the body features of the person in addition to the voice of the person to further improve the accuracy of extracting the person's message.12-16-2010
20120284028METHODS AND APPARATUS TO PRESENT A VIDEO PROGRAM TO A VISUALLY IMPAIRED PERSON - Methods and apparatus to present a video program to a visually impaired person are disclosed. An example method comprises detecting a text portion of a media stream including a video stream, the text portion not being consumable by a blind person, retrieving text associated with the text portion of the media stream, and converting the text to a first audio stream based on a first type of a first program in the media stream, and converting the text to a second audio stream based on a second type of a second program in the media stream.11-08-2012
20090313020TEXT-TO-SPEECH USER INTERFACE CONTROL - A system and method includes a detecting computer readable text associated with a device, detecting a starting point for a text-to-speech conversion of text, beginning the text-to-speech conversion upon detection of movement of a pointing device in a direction of text flow, and controlling a rate of the text-to-speech conversion based on a rate of movement of the pointing device in relation to the text to be converted.12-17-2009
20120029920Cooperative Processing For Portable Reading Machine - A handheld device includes an image input device capable of acquiring images, circuitry to send a representation of the image to a remote computing system that performs at least one processing function related to processing the image and circuitry to receive from the remote computing system data based on processing the image by the remote system.02-02-2012
20080319754Text-to-speech apparatus - According to an aspect of an embodiment, an apparatus for converting text data into sound signal, comprises: a phoneme determiner for determining phoneme data corresponding to a plurality of phonemes and pause data corresponding to a plurality of pauses to be inserted among a series of phonemes in the text data to be converted into sound signal; a phoneme length adjuster for modifying the phoneme data and the pause data by determining lengths of the phonemes, respectively in accordance with a speed of the sound signal and selectively adjusting, the length of at least one of the phonemes which is a fricative in the text data so that the at least one of the fricative phonemes is relatively extended timewise as compared to other phonemes; and an output unit for outputting sound signal on the basis of the adjusted phoneme data and pause data by the phoneme length adjuster.12-25-2008
20120046947Assisted Reader - An electronic reading device for reading ebooks and other digital media items combines a touch surface electronic reading device with accessibility technology to provide a visually impaired user more control over his or her reading experience. In some implementations, the reading device can be configured to operate in at least two modes: a continuous reading mode and an enhanced reading mode.02-23-2012

Patent applications in class Image to speech