Patent application title: Integrated echo canceller and speech codec for voice-over IP(VoIP)
Inventors:
Kishan Shenoi (Saratoga, CA, US)
Kishan Shenoi (Saratoga, CA, US)
IPC8 Class: AH04J310FI
USPC Class:
370201
Class name: Multiplex communications crosstalk suppression
Publication date: 2011-09-29
Patent application number: 20110235500
Abstract:
A method includes operating an integrated echo canceller and speech codec
for voice-over internet protocol. An apparatus includes an echo canceller
and a speech codec, wherein the speech codec includes a decoder and an
encoder, and wherein the echo canceller and the speech codec are
integrated for voice-over-internet protocol.Claims:
1. A method, comprising operating an integrated echo canceller and speech
codec for voice-over internet protocol.
2. The method of claim 1, wherein operating includes providing a power of a received signal, {x(n)} to the echo canceller by the speech codec.
3. The method of claim 2, wherein when the speech codec has introduced a comfort noise to fill in for silence, the power of the comfort noise is provided to the echo canceller by the speech codec.
4. The method of claim 1, wherein operating includes providing a power of a transmit-in signal {y(n)} to the speech codec by the echo canceller.
5. A computer program, comprising computer or machine readable program elements translatable for implementing the method of claim 1.
6. A machine readable medium, comprising a program for performing the method of claim 1.
7. An apparatus, comprising: an echo canceller and a speech codec, wherein the speech codec includes a decoder and an encoder, and wherein the echo canceller and the speech codec are integrated for voice-over-internet protocol.
8. The apparatus of claim 7, wherein a power of a received signal, {x(n)} is provided to the echo canceller from the decoder.
9. The apparatus of claim 8, wherein when the decoder has introduced a comfort noise to fill in for silence, the power of the comfort noise is provided to the echo canceller by from the decoder.
10. The apparatus of claim 7, wherein a power of a transmit-in signal {y(n)} is provided to the speech codec from the echo canceller.
11. The apparatus of claim 7, wherein the echo canceller includes power calculator coupled to a non-linear processor.
12. The apparatus of claim 7, wherein the speech codec includes both a silence detector and an allocator coupled to the encoder.
13. The apparatus of claim 7, further comprising an analog interface coupled to the integrated echo canceller and speech codec for voice-over-internet protocol.
14. The apparatus of claim 7, further comprising both a receive jitter buffer and a transmitter buffer coupled to the integrated echo canceller and speech codec for voice-over-internet protocol.
15. A digital switched network integrated access device, comprising the apparatus of claim 11.
16. The method of claim 4, wherein the echo canceller introduces comfort noise to fill in for non-linearly-processed echo residual and the power of the comfort noise is provided to the speech codec by the echo canceller.
17. The method of claim 1, wherein the echo canceller introduces comfort noise to fill in for non-linearly-processed echo residual and the echo canceller flags the segment of speech for the speech codec to treat as silence.
Description:
CROSS-REFERENCES TO RELATED APPLICATIONS
[0001] This application claims a benefit of priority under 35 U.S.C. 119(e) from copending provisional patent applications U.S. Ser. No. 61/340,923, filed Mar. 24, 2010, U.S. Ser. No. 61/340,922, filed Mar. 24, 2010 and U.S. Ser. No. 61/340,906, filed Mar. 24, 2010, the entire contents of all of which are hereby expressly incorporated herein by reference for all purposes.
BACKGROUND INFORMATION
[0002] 1. Field of the Invention
[0003] Embodiments of the invention relate generally to the field of digital networking communications. More particularly, an embodiment of the invention relates to methods and systems for packet switched networking that include an integrated echo canceller and speech codec for voice-over-IP (VoIP).
[0004] 2. Discussion of the Related Art
[0005] The modern Internet has its roots in ARPANET, a network used primarily by academic institutions to link computers. An outgrowth of the original communication protocols, Internet Protocol (IP) is the predominant choice for Layer-3 protocol suites in modern networks and is particularly appropriate for data communication, involving file transfers and other "non-real-time" applications. The Internet, however, is being considered for a variety of applications, including, but not restricted to, real-time applications such as voice communication (VoIP, or Voice over IP). This multiplicity of services, with different needs, is being addressed by protocol enhancements such as DiffServ (for differentiated services), whereby packet streams are identified and processed according to specific needs. Some services, such as file transfer, can tolerate longer time delays and larger time-delay variations than other services, such as VoIP that demand shorter delays and small time-delay variation. Such service attributes are collectively referred to by the term Quality of Service (abbreviated QoS). In particular, a small time-delay variation is associated with a high QoS and a large time-delay variation associated with a low QoS. Whereas QoS is a generic term and is the amalgamation of various service attributes, the most prevalent interpretation of the term is related to control processes that attempt to address these issues.
[0006] More relevant for speech communication is the notion of Quality of Experience or QoE. The perceived satisfaction, or dissatisfaction, with a voice call is inherently subjective. The goal of a network is to provide toll-quality. Simply put, toll-quality represents the QoE of a telephone call over a very "clean" connection, a call that a preponderance of human beings would be quite satisfied with. The ITU-T has developed a metric, called the R-value, for defining toll-quality and an excellent description is provided in Ref. [1]. The R-value associated with a "perfect" call, one with zero transmission delay, all parameters optimally tuned, uncompressed speech (G.711 coding, see Ref. [2c]), and no signal drop-outs, is about 94. An R-value greater than about 80 is considered toll-quality. An R-value below 65 is representative of the condition that many users are dissatisfied (below PSTN-quality as expressed in Ref. [1]) and below 50 is not recommended (60 and below is considered very poor in Ref. [1]), indicating that almost all users would be unhappy with the QoE. As described in Ref. [1], and references therein, the degradation in quality associated with any parameter (such as delay, distortion, packet-loss, etc.) can be quantified as a drop in R-value.
[0007] A network based on packet switching and transmission can be quite complex, but the simple model depicted in FIG. 1 is sufficient to illustrate how signal processing plays a role. We consider an IAD (Integrated Access Device) at the customer premise as the traffic aggregator. All the various services are provided from the IAD to which all the customer equipment is connected. To allow for attachment of legacy devices such as telephones and facsimile ("Fax") machines, the IAD will provide an FXS (the term FXS and FXO are common in telephony and are explained shortly) port to which the telephone (Fax machine) is connected. To the telephone (Fax machine), the FXS port appears, for all intents and purposes, as the line circuit of a traditional Class-5 switch. The IAD contains the codec where the conversion between analog and digital is accomplished. The information, however, is not transported as a conventional DS0 ("Digital Signal level 0") (64 kbit/s bearer channel) as would be the case in a TDM (time division multiplexed) or circuit-switched scenario. The data is packetized and encapsulated in the appropriate "wrappers" for transmission over the packet network.
[0008] In terms of the important processes involved after call set-up, a simple, though accurate, view is depicted in FIG. 2. For convenience only one direction of transmission is shown. The analog signal from the source ("srce") is converted into digital format using an A/D converter. It is quite conventional to use a conventional telephony codec that uses a sampling rate of 8 kHz and encodes the sample value in an octet (G.711 coding) though there are implementations described in the literature where a higher sampling rate and a higher word-length are used for improved fidelity. These samples are assembled into packets. For speech applications there may be some signal processing involved for purposes of echo cancellation and data compression. The packets are delivered to the destination by the packet network. At the destination the information is extracted from the packets, the requisite signal processing performed to regenerate the speech signal that is played out via the digital-to-analog converter (DAC). The jitter buffer function indicated allows for time-delay variation (packet-delay variation) in the network whereby it is permissible to have different transit delays for different packets provided the variation is less than the extent of the jitter buffer.
[0009] The interface to the user device can be quite varied. For example, the end-user may choose to use a conventional telephone and in this case the so-called "FXS" interface is appropriate. If the end-user equipment is a PBX (Private Branch Exchange) then the natural interface is the so-called "FXO" interface. To the telephone set the IAD FXS interface looks like a switch; to a switch the IAD FXO interface looks like a telephone.
[0010] In FIG. 3 the essential parts of an FXS interface are depicted. The drop-side interface (the end-user VF equipment) in this situation is a regular telephone set. The "SLIC" function (for Subscriber Line Interface Circuit, terminology that has its origins in the development of switches) represents circuitry that provides battery voltage (to power the telephone), performs the supervision function (decides whether the telephone in "on-hook", i.e., not in use, or "off-hook", i.e., in use), provides ringing voltage to alert the end-user to an incoming call (the ringer arrangement is not shown in FIG. 3), and performs a 2-wire-to-4-wire or "hybrid" function (the 2-wire interface carries signals for both directions of transmission; 4-wire implies that the directions of transmission are separate). The codec function involves the conversion of the (analog) speech from the telephone to digital format for transmission to the distant end as well as conversion of the digital signal representative of speech from the distant end into an analog form for transmission to the local telephone set. Conventionally, the conversion rate is 8 kHz and the digital format is 8 bits/sample (G.711) (quadrature-law in North America, A-law in Europe and most of the rest of the world) for a net bit rate of 64 kbit/s (equivalent to a DS0).
[0011] In FIG. 4 the essential parts of an FXO interface are depicted. The functionality is akin to a reflection of the FXS. The drop-side interface (the end-user VF equipment) in this situation is a switch, such as a PBX. The "DAA" function (for Data Access Arrangement, terminology that has its origins in the development of voice-band modems) represents circuitry that accepts battery voltage from the switch. The DAA performs the reverse supervision function (mimicking a telephone that is "on-hook", i.e., not in use, or "off-hook", i.e., in use) by presenting an open circuit (on-hook) or a closed circuit, drawing current from the switch (off-hook); detects ringing voltage to ascertain that the switch is alerting the end-user to an incoming call; and performs a 2-wire-to-4-wire or "hybrid" function (the 2-wire interface carries signals for both directions of transmission; 4-wire implies that the directions of transmission are separate). The codec function involves the conversion of the (analog) speech from the switch to digital format for transmission to the distant end as well as conversion of the digital signal representative of speech from the distant end into an analog form for transmission to the switch. As in the FXS interface, the conversion rate is 8 kHz and the digital format is 8 bits/sample (G.711) (quadrature-law in North America, A-law in Europe and most of the rest of the world) for a net bit rate of 64 kbps (equivalent to a DS0).
[0012] In a TDM environment the supervision state of the telephone (on-hook or off-hook) is detected by the FXS interface and encoded in a signaling bit for transmission to the distant end where the FXO mimics an on-hook/off-hook telephone to the switch. The alerting state of the switch (ringing present or not present) is detected by the FXO interface and encoded in a signaling bit for transmission to the distant end where the FXS mimics a switch by applying (or not) ringing voltage to the telephone set connected to the IAD. It is not uncommon for the FXS unit to terminate application of ringing voltage when it determines the telephone has gone off-hook without waiting for the "gone-off-hook" transition to be transported to the distant end switch, which will terminate ringing at that point in time when it detects that off-hook condition and cause the FXO to send a "ringing-not-present" status condition. In an IP environment the same functionality is provided but is achieved by messages. Protocols such as SIP and H.323 serve the purpose of defining the procedures and methods for delivering such supervisory information.
[0013] It should be clear that in the cases described above, the VF Interface, from the view of the signal processing block within the IAD, looks like a digitized signal, with each individual channel having its own status indicator bits (the A/B signaling bits) and the level of digitization format follows G.711, corresponding to a sampling rate of 8 kHz and 8-bits-per-sample encoding law (quadrature-law or A-law). In the subsequent discussion it will be assumed that this is the "generic" VF interface from the viewpoint of the signal processing or "line-side" functions of the IAD.
[0014] When dealing with speech or speech-type signals the common unit of power used in the telecommunication literature is "dBm0". This is a logarithmic unit (decibel) and has been adopted to deal with signals that are in digital format. It is often necessary to talk about the "power" of the signal even though the notion of power in physical terms is associated with the rate of dissipation of energy in a resistive element, a concept that is "analog" in nature. A relationship between numerical values characteristic of digital signals and power values (in watts, or milliwatts, or dBm) characteristic of analog signals is built via the definition of a "reference signal" and "reference power level". By definition, the "reference signal" is a 1 kHz tone (often referred to a "test-tone"). In the digitized format, considering the sampling rate is 8 kHz, a 1 kHz signal can be specified using just eight numbers (there are 8 samples in each cycle of the 1 kHz tone) and fewer distinct values, using the symmetry properties of the sinusoidal waveform. A particular pattern has been agreed upon and, by definition, the power of the 1 kHz tone with amplitude values satisfying this pattern is 0 dBm0. The corresponding tone is the test-tone and the digital signal with this pattern of sample values is referred to as the digital milliwatt.
[0015] When the digital test-tone is converted into analog, the power of the resulting (analog) 1 kHz signal, say X dBm, is called the (receive) reference level or (receive) transmission level power (TLP). A particular analog 1 kHz signal of power, say, Y dBm will, upon conversion to digital, provide a sequence of numerical sample values that match (from a power standpoint) a digital milliwatt. Then Y dBm will be the value for the (transmit) reference level or (transmit) TLP. Whereas it is common for the transmit TLP and receive TLP to be numerically the same, there are numerous exceptions.
[0016] The peak values associated with the codec are described in terms of the power of a sine-wave whose amplitude corresponds to the maximum value that can be represented by the code. In the case of quadrature-law, this maximum value or "crash-point" is +3.17 dBm0; for the A-law codec the crash-point is +3.14 dBm0; it has become quite customary to consider the crash-point for a telephony codec to be +3 dBm0 and ignore the minor differences between the two formats for purposes of calculating power and related quantities.
[0017] When dealing with noise powers, values such as -70 dBm0 are common. For convenience a different (though obviously related) unit has become prevalent. In particular, a reference noise level of -90 dBm0 has been chosen (-90 dBm is equivalent to 1 picowatt or 10.sup.quadrature12 watt). Thus a -70 dBm0 noise level can be referred to as +20 dBrn0. For evaluating the subjective impact of noise, spectrally weighted power measures are commonplace. In North America the spectral weighting is called "C-message weighting" and identified by including the letter "C" in the unit. For example, typical idle-channel noise associated with a "clean" channel is 17 dBrnC0. For the interested reader, Ref. [3] provides a clear and concise description of these concepts.
[0018] The round-trip delay experienced by a voice channel could be quite significant. In these situations it is advisable (often mandatory) to have some means for controlling echo (see Ref. [3]). Furthermore, for better communication capability it is advisable to use echo cancellers rather than echo suppressors. Considering that both IADs, at either end of the point-to-point link will include echo canceling, it suffices that the echo cancellers be of the "split" variety, i.e., each side removes the echo introduced in its local 2-wire-to-4-wire converter ("hybrid"). Details of echo cancellers and echo canceling algorithms are widely available in the literature, including Ref. [3].
SUMMARY OF THE INVENTION
[0019] There is a need for the following embodiments of the invention. Of course, the invention is not limited to these embodiments.
[0020] According to an embodiment of the invention, a process comprises: operating an integrated echo canceller and speech codec for voice-over interne protocol. According to another embodiment of the invention, a machine comprises: an echo canceller and a speech codec, wherein the speech codec includes a decoder and an encoder, and wherein the echo canceller and the speech codec are integrated for voice-over-internet protocol.
[0021] These, and other, embodiments of the invention will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following description, while indicating various embodiments of the invention and numerous specific details thereof, is given for the purpose of illustration and does not imply limitation. Many substitutions, modifications, additions and/or rearrangements may be made within the scope of an embodiment of the invention without departing from the spirit thereof, and embodiments of the invention include all such substitutions, modifications, additions and/or rearrangements.
BRIEF DESCRIPTION OF THE DRAWINGS
[0022] The drawings accompanying and forming part of this specification are included to depict certain embodiments of the invention. A clearer concept of embodiments of the invention, and of components combinable with embodiments of the invention, and operation of systems provided with embodiments of the invention, will be readily apparent by referring to the exemplary, and therefore nonlimiting, embodiments illustrated in the drawings (wherein identical reference numerals (if they occur in more than one view) designate the same elements). Embodiments of the invention may be better understood by reference to one or more of these drawings in combination with the following description presented herein. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale.
[0023] FIG. 1 is a functional block view of a simplified model of a voice-band connection over an IP Network.
[0024] FIG. 2 is a functional block view of transmission of voice-band signals over a packet network.
[0025] FIG. 3 is a functional block view of essentials of an FXS IAD Interface.
[0026] FIG. 4 is a functional block view of essentials of an FXO IAD Interface.
[0027] FIG. 5 is a functional block view of Key elements of VF signal processing.
[0028] FIG. 6 is a functional block view of essential signal processing components in an echo canceller.
[0029] FIG. 7 is a functional block view of essential components of the non-linear processor.
[0030] FIG. 8 is a functional block view of qualitative variation of comfort noise power and threshold level with degree of double-talk.
[0031] FIG. 9 is a functional block view of basic hardware blocks required to implement the invention an integrated echo canceller and speech codec for voice-over-IP (VoIP).
[0032] FIG. 10 is a functional block view of key functions in signal processing for VoIP.
[0033] FIG. 11 is a functional block view of functions associated with echo cancellation and compression.
[0034] FIG. 12 is a functional block view of NLP based on degree of double-talk and level of compression.
DESCRIPTION OF PREFERRED EMBODIMENTS
[0035] Embodiments of the invention and the various features and advantageous details thereof are explained more fully with reference to the nonlimiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well known starting materials, processing techniques, components and equipment are omitted so as not to unnecessarily obscure the embodiments of the invention in detail. It should be understood, however, that the detailed description and the specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only and not by way of limitation. Various substitutions, modifications, additions and/or rearrangements within the spirit and/or scope of the underlying inventive concept will become apparent to those skilled in the art from this disclosure.
[0036] Within this application several publications are referenced by Arabic numerals, or principal author's name followed by year of publication, within parentheses or brackets. Full citations for these, and other, publications may be found at the end of the specification immediately preceding the claims after the section heading References. The disclosures of all these publications in their entireties are hereby expressly incorporated by reference herein for the purpose of indicating the background of embodiments of the invention and illustrating the state of the art.
[0037] The standard for digitizing voice uses G.711 coding (μ-law or A-law) which uses 8 bits per speech sample at a sampling rate of 8 kHz. The net bit rate is thus 64 kbps (a DS0). For efficient transmission, it is conventional to use "speech compression" algorithms that reduce the necessary bit rate from 64 kbps to a lesser value. The G.726 (Ref. [2d]) and G.727 (Ref. [2e]) standards describe two such compression (and decompression or expansion) schemes. In particular, with these "waveform encoding" schemes it is possible to reduce the word-length per sample from 8 bits to 5 or 4 or 3 or 2, corresponding to bit rates of 40, 32, 24, or 16 kbps, respectively. Other encoding schemes achieve compression by encoding collections of samples (often referred to as "analysis frames", "analysis blocks", or just "frames" or "blocks") and can achieve greater compression, corresponding to "bits per sample" of less than 1. The G.728 (Ref. [2f]) and G.729 (Ref. 2g]) algorithms are examples of these high compression algorithms.
[0038] The block-encoding schemes like those of G.729 are more complex and require significantly more processing power to implement than the waveform encoding schemes such as described in G.727. The block-encoding schemes are also less tolerant of transmission errors, introduce more delay, and are not well suited for variable bit-rate encoding. However, algorithms such as G.729 can support digital speech interpolation (DSI). The approach here is to detect "silence" using a voice activity detector ("VAD" is also voice activity detection). Silence frames are encoded with reduced precision (number of bits per block) or even just not transmitted. That is, on the average we achieve variable bit-rate transmission.
[0039] A significant advantage of algorithms such as those described in G.727 and G.729 is that certain echo cancellation functions and certain voice compression functions can be advantageously implemented in a collaborative manner.
[0040] The essential elements of the VF signal processing block for one channel are shown in FIG. 5. The signaling bits (if provisioned) play a role in determining whether the channel is idle or in-use but have been omitted for clarity. For multiple channels, the same processing block is replicated, either physically or logically. The arrangement shown is typical of the prior art whereby the speech-compression and echo cancellation are done in separate modules.
[0041] Note that the "speech compression" block includes the functionality of compression, whereby the data rate associated with the (uncompressed) speech signal is reduced. In addition, it includes as well the functionality of decompression whereby from reduced data rate stream the uncompressed signal suitable for delivery to the VF interface is produced.
[0042] The essential elements of the signal processing functions in the echo canceller are diagrammed in FIG. 6. Normally (as in the prior art) the echo canceller function and the voice compression function are done independently. However, it is shown here as part of the overall signal processing scheme that it is advantageous to link the two functions since there are certain features that can be merged and implemented more efficiently when the two functions are linked together than when they are done separately. To better explain this collaboration, brief descriptions of the two functions are provided, starting with the echo canceller. The operation of echo cancellers is described in detail in Ref. [3]. Just the highlights are given here to show how the compression and cancellation blocks can collaborate in an advantageous manner.
[0043] As depicted in FIG. 6, the "receive-in" signal, {x(n)}, is representative of the signal coming from the far-end and we assume that the (de)compression function has expanded the signal representation from compressed form to a "linear" format. As shown in Ref. [3], it suffices to use a 13-bit (or 14-bit) word-length for adequate fidelity. In practice, the echo canceller will be implemented using a word-length of 16 bits or greater for the arithmetic unit. For delivery to the VF interface that expects G.711 (octet format) for the signal samples, a conversion from linear-to-G.711 is required but has been omitted from FIG. 6 for clarity.
[0044] The signal {x(n)} is converted into an analog signal (in the case of FXO and FXS this conversion is done in the IAD VF interface), for delivery to the end-point station set (telephone), in the codec. A portion of this signal is returned, in conjunction with the locally generated speech signal, because the 2-wire-to-4-wire conversion may not be ideal. This returned signal is the "echo" signal. The intent of the echo canceller is to prevent, to the extent possible, the echo component of the signal from returning to the far end. That is, it attempts to make the transmit-out signal, {v(n)}, destined to the far-end to be as "echo-free" as possible. The transmit-in signal, {y(n)}, consists of a combination of the local speech signal, {s(n)}, and the echo signal, {e1(n)}; ideally, {v(n)}={s(n)}. This is achieved, to a large extent, by using an adaptive FIR (finite impulse response) filter that uses {x(n)} as input and generates an echo-replica, {e(n)} that is subtracted from {y(n)}. The Non-Linear-Processor (NLP) function serves to clean-up the signal to remove, if possible, the remnants of the echo signal that is not completely removed by the subtraction. Considering that the adaptive filter is supposed to model the echo path and that this modeling is rarely, if at all, perfect, the subtractive process cannot remove the entire echo and hence the need for the NLP.
[0045] The particulars of the adaptive FIR filter are described in Ref. [3] in great detail. Here we provide just the key aspects. The filter block contains memory for storing up to N samples of past inputs, i.e., {x(n), x(n-1), x(n-2), . . . , x(n-N+1)} where N is the length of the FIR filter (the maximum expected length) and is representative of the "tail delay" for which the echo canceller is supposed to operate. Usually, tail delay is specified in terms of time in milliseconds (ms) and typical values for tail delay are 16, 32, 48, 64, 96, and 128 ms, corresponding to N=128, 256, 384, 512, 768, and 1024 (samples), respectively. When the 2-wire-4-wire converter (the point where the echo is caused) is close to the echo canceller, smaller values of N can be used. In the case where the 2-wire-4-wire converter is in the IAD itself (such as for the FXS and FXO interfaces), a value of N=16 is more than adequate.
[0046] The FIR filter, at any point in time, can be described by its N coefficients, {h(k); k=1, 2, . . . , (N-1)}. We will use super-scripts where necessary to indicate the time-varying nature of these coefficients. The operation of the adaptive filter can be described by the following equations:
e ( n ) = k = 0 k = N - 1 h ( m ) ( k ) x ( n - k ) filter output producing echo replica w ( n ) = y ( n ) - e ( n ) echo replica subtract Δ h ( m ) ( k ) = μ e ( n ) x ( n - k ) coeff . increment ; k = 0 , 1 , , ( N - 1 ) h ( m + 1 ) ( k ) = h ( m ) ( k ) + Δ h ( m ) ( k ) coeff . update ; k = 0 , 1 , , ( N - 1 ) ( Eq . 1.1 ) ##EQU00001##
[0047] The brackets used in the coefficient increment calculation represent a time-average. The quantity μ is the adaptation gain. In the traditional LMS (least-mean-square) algorithm postulated for adaptive filters, the coefficient update is done in every sample interval and the "average" implied in the coefficient increment is moot, being over one term. It is shown in Ref. [3] that averaging over multiple terms has some significant benefits. The choice of μ is also important and involves a trade-off. If μ is too large, instability is a problem. A large value of μ (but small enough to retain stability) will provide a more rapid convergence but the final, converged, state may not be very good. In contrast, a small value for μ may have slow convergence, but will have a better final, converged, state. In traditional echo canceller applications the echo path could change from call to call and thus a rapid convergence time is important and the designer attempts to use as large a value of adaptation gain as is feasible and techniques have been described to make the gain variable; large to begin with and small after "convergence" has been achieved. In the case of an IAD, the echo path is static. In fact, in the case of FXS and FXO interfaces it changes with installation, but is reasonably constant. In these situations, convergence time is less of an issue and the adaptation gain can be kept small, rendering instability moot.
[0048] The measure of efficacy of an echo canceller is usually described in terms of ERL and ERLE. ERL, or echo return loss, is a measure of the strength of the echo, defined relative to the strength of the receive-in signal, which is, after all, the source of the echo power. ERLE, or echo return loss enhancement, is a measure of how much the echo power has been reduced by the echo canceller. Both are described using logarithmic (decibel) units. That is (assuming that the receive-in level is substantial),
ERL = - 10.0 log 10 ( σ e 2 σ x 2 ) dB ERLE = ( ERL post - echo - canc . - ERL pre - echo - canc . ) dB ( Eq . 1.2 ) ##EQU00002##
ERLmin, refers to the minimum echo return loss provided by the 2-wire-4-wire (analog) conversion itself. The echo canceller is thus guaranteed that the echo level will be that many dB below the receive-in signal level. Common values are 0 dB and 6 dB for network echo cancellers. In the case of IADs, especially with FXS or FXO interfaces, there is substantially more control over the impedances involved and the minimum ERL (i.e. ERLmin), can be made as high as 12 dB.
[0049] In order to describe the operation of the Non-Linear Processor (NLP), we first need to describe the notions of "single-tall", "soft-double-talk", "hard-double-talk" and so on. Single-talk represents the situation wherein the receive-in signal strength is substantial but the near-end speaker is silent. That is, {x(n)} is substantial (i.e., there is incoming speech power from the far end), but the (local) signal {s(n)} is weak, essentially silence. In this situation the transmit-in signal, {y(n)}, is comprised primarily of the echo signal. Hard-double-talk is the situation wherein the local speech signal is substantially more powerful than the receive-in speech signal. In this situation, the transmit-in signal is comprised primarily of the local speech signal. Soft-double-talk represents the situation in between, wherein neither side is silent. Double-talk conditions determine the operation of the NLP as well as adaptation. The filter is allowed to adapt "freely" in single-talk; adaptation is frozen in hard-double-talk situations, usually by making the adaptation gain zero; adaptation is slowed down in soft-double-talk situations, by making the adaptation gain much less than it would be in the case of single-talk. For convenience we shall define two new terms. Single-talk-receive is the situation wherein the receive-in signal is active (substantial power) but the local speaker is silent (the power of {s(n)} is negligible) (this is equivalent to what was referred to as single-talk just earlier); single-talk-transmit is the situation where the local speaker is active, but the power of the receive-in signal is negligible (this situation was one facet of hard-double-talk mentioned just earlier). These definitions will help explain the operation of the Non-Linear Processor.
[0050] The function of the Non-Linear Processor is to remove the last vestiges of echo. In other words, the NLP increases the ERLE from that provided by the adaptive filter to a value that is very large, technically infinite. One of the key principles underlying the NLP is that the human auditory system is very sensitive to what is considered "intelligible", or recognizable, signals but is tolerant of high levels of unintelligible, or noise, signals. That is, the remaining echo signal (which is intelligible) may be substituted with a noise signal of higher power but yet the channel is perceived as sounding better! A second principle of the NLP is that the signal power should never be too small or the line will appear "dead", a very major cause for reduction in end-user Quality of Experience (QoE).
[0051] These two principles are embodied in the NLP, the general functionality of which is depicted in FIG. 7.
[0052] The controllable parameters of an NLP as depicted in FIG. 7 are "T", the threshold of the non-linearity, and the power of the additive comfort noise, PC (the manner of generating a noise signal of specified power is not elaborated in FIG. 7). The function of the NLP, once the controllable parameters are specified (they are not static but are changed based on the state of the echo canceller as determined by the control mechanism), is straightforward. The threshold device "squelches" the signal by forcing all signal samples of magnitude less than T to a zero value; signal samples of magnitude greater than T are passed unchanged. The comfort noise generator produces a random signal, preferably "white noise", of power PC that is added to the output of the threshold device to obtain the transmit-out signal, {v(n)}. In the case of network echo cancellers, the number representation, or format, must be altered from the uniform (or linear PCM) format used for arithmetic operations to the companded format (μ-law or A-law; G.711) required for transmission. In the case of an IAD, the subsequent processing, namely the compression step, can utilize the signal in uniform format directly.
[0053] It is customary to express sample values using a normalized scale where 1.0 corresponds to the maximum value represented (on a linear scale). In Telephony, this maximum value is expressed in (normalized) logarithmic (dBm or decibels relative to 1 milliwatt) units as +3 dBm0 (see G.711 and G.712). Thus if T=1.0 (or +3 dBm0), the signal {w1(n)} is completely squelched; if T=0.0 then w1(n)=w(n) and the threshold device has no impact on the signal.
[0054] The comfort noise power, PC, is intended to be equivalent to the background noise power of the transmit signal. That is, the signal power measured or estimated when both sides are silent. Thus when the threshold device squelches the transmit signal (using a large value for T, for example), the signal {w1(n)} corresponds to zero value samples, a condition that sounds eerily silent. This level of silence is actually disturbing to the listener at the far end who may believe the connection is "dead". Consequently by adding in the "comfort noise" the connection seems "normal". By making the comfort noise power equal, to the extent possible, the background noise power, the periods of squelching sound comfortably normal. In fact, using white noise, characterized by a flat power spectral density, for the comfort noise signal makes these interludes actually very pleasing to the human ear.
[0055] We shall denote the situation wherein the transmit-out signal is "quiet", either background noise or comfort noise, as transmit-silence; a similar situation for the receive-in path (which is the transmit-out at the distant end) can be referred to as receive-silence. These terms will be helpful when we discuss the combined operation of echo canceller and voice compression.
[0056] The control of the Non-Linear Processor is based on reasonably complex algorithms that have traditionally been kept as closely guarded trade secrets by echo canceller manufacturers. A simplified version of a representative method is described here. However, it will be apparent later that, by considering a joint control mechanism between the echo canceller and the voice compression block, a new methodology is proposed that has better performance characteristics than that of prior art and some significant advantages from the viewpoint of implementation.
[0057] The key to NLP control is the estimation of the "degree" of double-talk that we denote by the quantity ρ. In a single-talk-receive situation, ρ=0. In a single-talk-transmit situation, ρ=1. A simplistic view of the degree of double-talk is the ratio of the power of {s(n)} [or σs2], the power of the local talker, to the power of {y(n)} [or σy2], the total power of the transmit-in signal (the combination of local talker plus echo). Denote by PC.sup.(nom) the average power of the background noise. This would be the nominal value of the comfort noise power, whereas the value used for the comfort noise power at any time is denoted by PC. The chosen values for PC and the threshold value, T, at any time are based on the degree of double-talk, ρ, and can be described (qualitatively) by the graph in FIG. 8.
[0058] Of importance in FIG. 8 is the observation that when the transmit-in signal is primarily echo, as would be the case in a single-talk-receive situation, the degree of double-talk corresponds to ρ=0, the threshold is at its maximum value (T=1) implying that the transmit signal is completely squelched, and the comfort noise power is at its nominal (maximum) value of the ambient background noise. The net transmit condition is one of transmit-silence, as defined before. Similarly, when the transmit-in signal is primarily local talker, as would be the case in a single-talk-transmit situation, the degree of double-talk corresponds to ρ=1, the threshold is at its minimum value (T=0) implying that the transmit signal is unchanged by the threshold device, and the comfort noise power is essentially null (no comfort noise is added). Between these two extremes, the comfort noise power and threshold value change monotonically with respect to the degree of double-talk. Further enhancements to perceived voice quality can be obtained by introducing "hang-over" intervals for transitions between single-talk-receive and single-talk-transmit situations (for example, see Ref. [3] and [4]).
[0059] Network echo cancellers have to provide functionality whereby NLP on/off and canceller on/off are controlled by in-band signals (see G.165 and G.168). This is to accommodate situations where the channel is being used by voice-band modems (e.g. V.34, V.90, etc.) and the traffic signal is not voice but encoded data. In the case of an IAD, specifically an IAD designed for providing point-to-point voice communication links, such extensions are not required. That is, private-line IADs require echo canceller functionality that is a subset of network echo canceller functionality. Furthermore, network echo cancellers may be placed in the transmission path at a significant distance, transmission-wise, from the 2-wire-to-4-wire converter (the point of echo origin) and thus have to provide "tail delay" values upwards of 32 ms. In the case of a private-line IAD, especially when the 2-wire-to-4-wire conversion is done in the IAD, tail delay values of 16 ms or less can suffice. Considering that the performance of echo removal varies inversely as filter length, keeping the filter length to the minimum possible is advantageous.
[0060] There are several methods proposed and standardized for speech compression. The general intent is to lower the bit-rate required for transmission from the nominal 64 kbps (a DS0) to something less. Moderate compression ratios, bringing the rate down to between 40 kbps and 16 kbps, can be achieved using "wave-form encoding". In waveform encoding, the sampling rate is maintained at 8 kHz (8000 samples per second) but the word-length used to represent each sample is reduced from 8-bits/sample to 5-, 4-, 3-, or 2-bits per sample, corresponding to bit-rates of 40, 32, 24, and 16 kbps, respectively. ITU-T recommendations G.726 and G.727 describe such compression methods--G.726 applies to schemes where the compression ratio is chosen and then fixed; G.727 allows for changing the compression ratio "on the fly" (i.e., during operation).
[0061] Greater compression ratios can be achieved by using "block-encoding" methods such as CELP (Code Excited Linear Prediction) (see Ref [3] or G.728 or G.729) where the speech is treated in blocks or analysis-frames (or just "frames") of time, often 10 to 20 ms long. The entire segment (or frame) of speech is then processed to extract certain parameters. These parameters, suitably encoded, are transmitted for the frame. The decoder, using these parameters, generates a synthesized speech signal for that segment of time. The number of bits required to encode these parameters can be quite small and thus a higher compression ratio can be achieved than with waveform methods. Typical bit-rates using block-encoding techniques range from 16 kbps to 8 kbps.
[0062] The overall compression ratio can be increased, in both waveform and block-encoding methods, if advantage can be taken of periods of silence. This technique, referred to as DSI (digital speech interpolation), allocates near-zero bandwidth for the channel during periods of silence; the receiver recognizes this and synthesizes a comfort noise signal in its place. Clearly, in circuit-switched situations where a fixed amount of bandwidth is reserved for the channel, silent or otherwise, DSI is moot. In situations where the bandwidth is shared between multiple bearers, DSI can be very useful (see Ref. [3] and [4,5,6] for an example where DSI is used).
[0063] The G.726/G.727 and G.729 signal processing algorithms are described in detail in the corresponding ITU-T standards. A simplified discussion is provided here to show how certain functions of the compression algorithm can be amalgamated with other functions such as the echo canceller and the multiplexing and demultiplexing into and from ATM cells or IP packets.
[0064] The key to compression (also called "coding" in some industry literature) is to reduce the number of bits per sample used represent the signal. This reduction must be done in such a way that at the receiver a reasonable replica of the original (uncompressed) signal can be reproduced.
[0065] The compression and decompression algorithms are naturally operated on a "per-sample" basis. However, it is simple, and is the method of choice in most implementations that use Digital Signal Processor ("DSP") chips, to treat samples in "blocks" or "frames". That is, contiguous samples corresponding to m ms (8m samples, considering a sampling rate of 8 kHz) are processed as a batch. This implies buffering of the signals using buffers that are at least 8m samples in depth. Certain functions of the echo canceller, such as the generation of echo replica, replica subtraction, non-linear modification and addition of comfort noise, can be implemented either on a per-sample basis or in frames. Other functions in an echo canceller such as determination of adaptive filter coefficient updates, double-talk condition, threshold value, are naturally and advantageously done in a once-every-m-msec basis. Some functions such as determination of silence are also advantageously done on segments (frames/blocks) of samples.
[0066] The choice of frame length is somewhat arbitrary. Having frame sizes that are small permits the rapid reaction to changing situations. Large frame sizes are more amenable for efficient implementation. We find that a frame size of 10 ms is a good compromise. That is, m=10, and the frame size is 80 samples. The adjustment of the method for other frame sizes is straightforward but we shall restrict our discussion to the case of 80-sample frames for convenience.
[0067] Associated with each voice channel, there are several items of interest that must be suitably encapsulated and transmitted to the far end. These are:
[0068] Signaling bits. Associated with each channel are 2 bits, the "A" and "B" signaling bits, which indicate the state of the channel. These are conditioned for transmission on a per-frame basis in TDM scenarios. For VoIP the transmission is done "as necessary", typically when there is a change of state.
[0069] Speech Samples. There are 80 speech samples for each 10-ms frame. These may be compressed. The manner of encapsulating these samples is discussed later.
[0070] Encoding level. Whereas the encoding process in wave-form encoders (G.727) generates 5 bits per sample, for purposes of transmission fewer bits are used. In G.729, the number of bits is not a "per-sample" entity as in G.727 but represents the overall information content of the 10 ms block. Furthermore, for "silence", no speech samples need be transmitted, corresponding to 0 bits-per-sample in the case of wave-form encoding and 0 bits-per-block for G.729 (block-encoding methods). The manner of encapsulating these samples is discussed later.
[0071] Silence Status. A channel can be deemed "silent" if the signaling state associated with the channel corresponds to "on-hook" or "inactive". For an "off-hook" channel (for which there is an active call), silence is associated with a low power level (associated with background noise) or a state wherein the speech signal can be replaced by comfort noise, such as when the NLP of the echo canceller is squelching the signal. If the silence status is asserted, then it is not necessary to send information regarding the speech samples, since the far-end can synthesize the comfort noise. It is convenient to send a silence-status indicator independent of the signaling state. Whereas 1 bit of information is adequate to decide between silent/not-silent, it is convenient to imply this state by specifying the number of bits per sample used for transmitting the compressed "speech" samples (0 bits per sample is equivalent to silence).
[0072] Comfort Noise Level. On the one hand, the information regarding the comfort noise power level can be quite precise, providing a fine gradation in comfort noise power levels. On the other hand, is information regarding comfort noise is being transmitted quite often, every 10 ms in the case being discussed, and is also being transmitted during periods where the signal is not silent, an "incremental" approach is possible. Such an approach is described in Ref. [3] and in Ref. [4]. In fact, it suffices to transmit 2 bits of information in each frame in order to have a good enough characterization of the noise level in the far-end receiver for effective synthesis of comfort noise.
[0073] A problem with the prior art, wherein the echo canceller function and the voice compression functions are separate and independent, is immediately obvious. Specifically, consider the situation where the local talker is silent and the signal considered for transmission is primarily echo. The echo canceller function will first attempt to reduce the level of this echo via the subtractive process. The NLP function will attempt to squelch the residual echo and insert comfort noise. This comfort noise type and level is based on the estimate made by the echo canceller function. This last observation is crucial.
[0074] The voice compression function will then try to compress this echo-canceller-generated comfort noise. Considering that the compression efficacy is optimized for speech-like signals, the compression action on the comfort noise signal will be sub-optimal in performance. That is, if the voice compression does not recognize the echo-canceller-generated comfort noise for what it is, it may try and encode it. This implies that not only is the encoding sub-optimal in terms of performance, it is unnecessarily utilizing transmission bandwidth by not categorizing it as silence.
[0075] The voice compression function may categorize the echo-canceller-generated noise as "silence", or, at least, just back-ground noise. In the process of establishing the parameters for the voice-compression-generated comfort noise the parameters of the signal the voice-compression block estimates is not the actual background noise of the telephone but, rather, the comfort noise generated by the echo canceller. Consequently, the parameters of the comfort noise generated at the far end by the voice (de-) compression function, can be erroneous. Many implementations of VoIP have experienced this problem and the resultant effect is unnatural sounding performance with detrimental impact on the associated end-user Quality of Experience.
[0076] The implementation of the invention will likely be in the form of software executing on a processor. The conversion to and from analog (the VF Interface) can be implemented as an ancillary unit to the main processor. In FIG. 9 the primary hardware blocks required to implement this invention are depicted at a high level. Analog signal(s) 196 are received and transmitted by an analog interface 191. A clock 192 is coupled to the analog interface 191. Data 194 is communicated between the analog interface 191 and a processor block 193. An interrupt signal is transmitted from the analog interface 191 to the processor block 193.
[0077] The VF Interface (Analog Interface) (191 in FIG. 9) includes the conversion circuitry (ADC and DAC) and operates based on a local clock depicted by 192. The transfer of samples between the analog interface (191) and the main processor block (193 in FIG. 9) is achieved using an interrupt mechanism. This is generally done on a per-sample basis. The processor block includes the interrupt processing for handling these per-sample interrupts and collecting data into blocks (frames) and providing a software interrupt so that the main signal processing (sub)block can operate at the block/frame level.
[0078] The principal intent of the NLP is to provide echo return loss enhancement, even at the cost of signal quality. This is evident from FIG. 7 that shows a conventional NLP and the associated threshold device. The method described here is different. The threshold device is enhanced and the effect of the NLP achieved by controlling the encoding method. In particular, the effect of the NLP during double-talk can be achieved by reducing the bits-per-sample used in the compression. During periods of single-talk the threshold device is deemed to be active and thereby reduces the low-level echo to negligible levels with the assumption that comfort noise will be included to prevent loss of Quality of Experience caused by "eerie silence".
[0079] In order to explain this action, the key functional blocks involved for this discussion are provided in FIG. 10. A speech compression block 212 is coupled to an echo canceller block 211. A jitter buffer and depacketization block 213 is coupled to the speech compression block 212 for receive from the IP network. A buffer and packetization block is coupled to the speech compression block 212 for transmit to the IP network.
[0080] When implemented for VoIP, the signal processing functions must interact with the functions that are peculiar to the IP network. These IP-network-related functions are primarily there for packetization (transmit) and depacketization (receive). One distinction between transporting the information over a packet network rather than a circuit-switched network (TDM) is that the IP network introduces significant delay variation (also known as packet-delay variation or PDV or time-delay variation or TDV) as well as the possibility of packets being lost. In contrast TDM networks have small delay variations and may have bit-errors but no episodes of missing information (unless the link is down). Therefore in an IP environment there is the need for a jitter buffer which is an elastic buffer required to smooth out the delay variations and present information to the signal processing block in a regularly spaced (timely) manner. The depacketization block can also flag missing (i.e. lost) packets and allow the signal processing function to invoke packet loss concealment (PLC) procedures to mitigate the deleterious impact of packet loss on the end-user QoE. The PLC function is generally an integral part of the decompression functionality of the speech-compression block.
[0081] In the transmit direction the signal processing function can flag the presence of silence whereby the packetization function does not transmit packets but maintains the requisite information to flag the far-end that a packet was not transmitted so that a "missing" packet is not interpreted as a "lost" packet.
[0082] When RTP (Real Time Protocol) is employed, the packets also contain a time-stamp that reflects the originating time of the packet. In many implementations this time-stamp is created by the signal processing block, especially when the time-stamp represents the number of octets from the VF interface that are being represented in that particular packet. In general, the time-stamp is representative of elapsed time from a chosen epoch and the frequency reference is generally, though not always, related to the clock of the A/D converter of the VF interface.
[0083] FIG. 11 provides a more detailed view of the functionality of the echo cancellation and compression functions. An echo canceller block 221 is coupled to a compression block 222. The echo canceller block 221 includes a PWR CALC; CNG sub block 2212. An H block 2211 and an NLP block 2213 are coupled to the PWR CALC; CNG sub block 2212. The compression block 222 includes a decoder 2223. A CNG block 2222 is coupled to the decoder 2223. The compression block 222 includes a encoder 2224. A VAD block 2221 and an ALLOC block 2225 are coupled to the encoder 2224.
[0084] In FIG. 11, the sub block labeled "PWR CALC; CNG" (2212 in FIG. 11) in the Echo Canceller Block (221 in FIG. 11) performs several computations related to short-term power of signals. These include:
[0085] A. The power of the received signal, {x(n)}, to the echo canceller block. When the echo canceller and compression block are considered amalgamated, the signal power can be provided by the DECODER function in the case of several voice compression schemes such as G.729. If the decoder has introduced comfort noise to fill in for silence, the power is known and does not need to be recomputed in the Echo Canceller Block.
[0086] B. The power of the transmit-in signal {y(n)}. This is used in the decision making process for deciding whether there is double-talk, whether there is silence, and what is the appropriate background noise power for establishing the comfort noise power (that will be used by the distant end). Amalgamation of echo canceller and compression means that this computation can be shared between the two functions.
[0087] C. The powers of the echo replica, {e(n)}, and result, {w(n)}, of subtracting the echo replica from the transmit-in signal. {e(n)} and {w(n)} are not explicitly indicated in FIG. 11 but are shown in FIG. 6 where the functions of the echo canceller block are shown.
[0088] These short-term power calculations are used in the development of the double-talk parameters as well as in the establishment of comfort noise power. Suitable methods for this are provided in various publications related to echo cancellers including Ref. [3]. The invention described here does not restrict the algorithms used for double-talk detection and classification.
[0089] In the case of G.727 it is a simple matter of designating the "Type" and thus controlling the effective compression level. The implementation of the threshold (equivalent) device thus is changed to one of flow control rather than sample manipulation. The notion of "Type" is related to the coarseness of quantization applied in the compression function. G.727 allows for varying sample-word-size to achieve the varying level of compression. Traditional DS0 (encoded speech) formats use 8 bits per sample and a sampling rate of 8 kHz for a net bit-rate of 64 kbit/s. Compression to 40 kbit/s is achieved by reducing the word-size to 5 bits/sample; 32 kbit/s operation implies 4 bits/sample; 24 kbit/s implies 3 bits/sample; 16 kbit/s implies 2 bit/sample. If the speech is not being sent ("silence") we can consider that to be 0 bits/sample.
[0090] The fewer the bits per sample, the coarser is the quantization. Generally speaking, coarse quantization has the effect of squelching low-level signals and therefore has an effect similar to what we expect in the NLP of an echo canceller.
[0091] In FIG. 12 we have reproduced FIG. 8 that qualitatively described the operation of a conventional NLP with some additions. In the method proposed, the comfort noise is still added in and its power level controlled in much the same way as in a conventional NLP. However, the threshold device is dispensed with and the degree of double-talk is used to ascertain the level of compression. A higher level of compression (lower bit-rate) has a similar effect as a higher threshold (T). In particular, two thresholds, ρ1 and ρ2, are pre-established (the specific values depend on the manner in which the degree of double-talk is estimated). If the degree of double-talk is 1 (corresponding to complete squelching in the case of a conventional NLP threshold device) then we designate the transmission octets as being Type-0 (i.e. 0 bits per sample); at the far end, the decompression method replaces the signal with comfort noise. For double-talk degree between 1 and ρ1, comfort noise of the appropriate power is added and the signal then encoded by the G.727 processing. However, the frame is tagged as being Type-1, corresponding to a 16 kbps (2 bits/sample) compression level that is coarse enough to achieve the requisite echo return loss enhancement. Similarly if the degree of double-talk is between ρ1 and ρ2, the frame is tagged as Type-2; if the degree of double-talk is between ρ2 and 0.0, the frame is tagged as Type-3; for a degree of double-talk value of 0 (or extremely small), the NLP is effectively turned off by making the additive noise power zero and designating the frame as Type-4. It should be noted that the cell multiplexing block may reassign the compression level based on available bandwidth, but will never increase the bits-per-sample from that assigned by the NLP function.
[0092] When a channel is "silent", the transmission bandwidth allocated to the channel can be reduced to near-zero and made available to other voice or data channels. In some situations, such as when a channel is deemed "inactive" or "on-hook", or when the NLP function determines that the channel is squelched, a decision on whether the channel is silent or not is moot. In the "active" or "off-hook" state, the decision of whether the channel is silent or not is based on the short-term power of the signal.
[0093] The nature of speech is such that the short-term power level is quite variable during "active" intervals. It is commonplace to have short pauses and for the power level to appear "low" even though signal is carrying useful information. To maintain a high (subjective) voice quality it is necessary to account for such low power situations and not classify them as silent. This approach of "erring on the side of caution" provides good subjective speech quality, albeit at the expense of sometimes classifying an interval that is actually silent as active. The methodology is described in detail in Ref. [3] and Ref. [4] and briefly summarized here.
[0094] The short-term power for each frame (10 ms) of samples is computed. The relevant signal is {w(n)}, representing the transmit signal with (most of) the echo removed. The time aspect of the frames is represented using the index m, and thus σw2(m) represents the average power of the signal samples comprising frame m. Based on this power we can classify the frame as potentially-silent or potentially-active, according as the power is, respectively, less than or greater than a threshold, Ts, that is related to the power of the background noise. The next step is to determine whether the frame is silent or active (separate from potentially-silent and potentially-active).
[0095] The following algorithm is appropriate for the current application:
[0096] The frame indexed as m is silent if frame m is potentially-silent and frames (m-1), (m-2), . . . , (m-N) are all potentially silent. Here N is a parameter that describes the level of hang-over desired. Typically the ending of speech utterances involve a drop-off in short-term power. By making N large, the tendency to "clip" the ends of speech utterances is avoided. Suitable values for hang-over are between 10 and 40 ms and thus N will be between 1 and 4 (inclusive) (assuming 10 ms frame size).
[0097] Otherwise, frame m is active.
[0098] Note that frame m is silent if frame m is potentially-silent and frame (m-1) is silent. Also, if frame m is potentially-active it is active and it is possible for a frame to be potentially-silent but yet classified as active. It is a simple algorithm and a DSI scheme using such a simple algorithm is often referred to by the acronym RDSI (rudimentary digital speech interpolation). This is described in detail in Ref. [3] and Ref. [4].
[0099] When viewed in the context of the invention described here, the amalgamation of the non-linear processor block and the compression block permits a more efficient way to ascertain whether any transmission bandwidth is required for the frame in question. In addition to efficiency, false detection of silence and false detection of non-silence are both reduced in probability. This is because a prior art compression device will see comfort noise and/or low-level echo that should be considered silence.
[0100] The invention described here does not restrict the method for silence detection other than it be suitable for amalgamation with the echo canceller block. The algorithm described above meets this criterion.
[0101] The methods for estimation of comfort noise power and the generation of comfort noise of appropriate level are well established and are not described in detail here. These methods are included by reference such as those described in Ref. [3] and Ref. [4,5,6,8]. The invention described here does not restrict the algorithms used for comfort noise power estimation or generation.
DEFINITIONS
[0102] The term program and/or the phrase computer program are intended to mean a sequence of instructions designed for execution on a computer system (e.g., a program and/or computer program, may include a subroutine, a function, a procedure, an object method, an object implementation, an executable application, an applet, a servlet, a source code, an object code, a shared library/dynamic load library and/or other sequence of instructions designed for execution on a computer or computer system).
[0103] The term substantially is intended to mean largely but not necessarily wholly that which is specified. The term approximately is intended to mean at least close to a given value (e.g., within 10% of). The term generally is intended to mean at least approaching a given state. The term coupled is intended to mean connected, although not necessarily directly, and not necessarily mechanically. The term proximate, as used herein, is intended to mean close, near adjacent and/or coincident; and includes spatial situations where specified functions and/or results (if any) can be carried out and/or achieved. The term distal, as used herein, is intended to mean far, away, spaced apart from and/or non-coincident, and includes spatial situation where specified functions and/or results (if any) can be carried out and/or achieved. The term deploying is intended to mean designing, building, shipping, installing and/or operating.
[0104] The terms first or one, and the phrases at least a first or at least one, are intended to mean the singular or the plural unless it is clear from the intrinsic text of this document that it is meant otherwise. The terms second or another, and the phrases at least a second or at least another, are intended to mean the singular or the plural unless it is clear from the intrinsic text of this document that it is meant otherwise. Unless expressly stated to the contrary in the intrinsic text of this document, the term or is intended to mean an inclusive or and not an exclusive or. Specifically, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present). The terms a and/or an are employed for grammatical style and merely for convenience.
[0105] The term plurality is intended to mean two or more than two. The term any is intended to mean all applicable members of a set or at least a subset of all applicable members of the set. The phrase any integer derivable therein is intended to mean an integer between the corresponding numbers recited in the specification. The phrase any range derivable therein is intended to mean any range within such corresponding numbers. The term means, when followed by the term "for" is intended to mean hardware, firmware and/or software for achieving a result. The term step, when followed by the term "for" is intended to mean a (sub)method, (sub)process and/or (sub)routine for achieving the recited result. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. In case of conflict, the present specification, including definitions, will control.
CONCLUSION
[0106] The described embodiments and examples are illustrative only and not intended to be limiting. Although embodiments of the invention can be implemented separately, embodiments of the invention may be integrated into the system(s) with which they are associated. All the embodiments of the invention disclosed herein can be made and used without undue experimentation in light of the disclosure. Although the best mode of the invention contemplated by the inventor(s) is disclosed, embodiments of the invention are not limited thereto. Embodiments of the invention are not limited by theoretical statements (if any) recited herein. The individual steps of embodiments of the invention need not be performed in the disclosed manner, or combined in the disclosed sequences, but may be performed in any and all manner and/or combined in any and all sequences.
[0107] Various substitutions, modifications, additions and/or rearrangements of the features of embodiments of the invention may be made without deviating from the spirit and/or scope of the underlying inventive concept. All the disclosed elements and features of each disclosed embodiment can be combined with, or substituted for, the disclosed elements and features of every other disclosed embodiment except where such elements or features are mutually exclusive. The spirit and/or scope of the underlying inventive concept as defined by the appended claims and their equivalents cover all such substitutions, modifications, additions and/or rearrangements.
[0108] The appended claims are not to be interpreted as including means-plus-function limitations, unless such a limitation is explicitly recited in a given claim using the phrase(s) "means for" and/or "step for." Subgeneric embodiments of the invention are delineated by the appended independent claims and their equivalents. Specific embodiments of the invention are differentiated by the appended dependent claims and their equivalents.
REFERENCES
[0109] [1] Voice Performance over packet-based networks, by Danny De Vleeschauwer and Jan Janssen, An Alcatel White Paper, available at www.alcatel.com. [0110] [2] ITU-T Recommendations Series G, Transmission systems and media, digital systems and networks, available from the ITU-T website (www.itu.int). [0111] [2a] Rec. G.165, Echo cancellers. [0112] [2b] Rec. G.168, Digital network echo cancellers. [0113] [2c] Rec. G.711, Pulse code modulation (PCM) of voice frequencies. [0114] [2d] Rec. G.726, 40, 32, 24, 16 kbit/s Adaptive Differential Pulse Code Modulation (ADPCM). [0115] [2e] Rec. G.727, 5-, 4-, 3-, 2-bits per Sample Embedded Adaptive Differential Pulse Code Modulation (ADPCM). [0116] [2f] Rec. G.728, Coding of Speech at 16 kbit/s using low-delay code-excited linear prediction. [0117] [2g] Rec. G.729, Coding of speech at 8 kbit/s using CS-ACELP. [0118] [3] Kishan Shenoi, Digital Signal Processing in Telecommunications, Prentice Hall, 1995. ISBN 0-13-096751-3. [0119] [4] U.S. Pat. No. 5,065,395, "Rudimentary Digital Speech Interpolation Apparatus and Method". Issued Nov. 12, 1991. [0120] [5] U.S. Pat. No. 5,151,901, "DS1 trunk packing and unpacking apparatus and method". Issued Sep. 29, 1992. [0121] [6] U.S. Pat. No. 5,280,532, "N:1 bit compression apparatus and method". Issued Jan. 18, 1994. [0122] [7] U.S. Pat. No. 5,327,495, "Apparatus and Method for Controlling an Echo Canceller". Issued Jul. 5, 1994. [0123] [8] No. 5,007,086, "Apparatus and method for generating low level noise signals". Issued Apr. 9, 1991.
User Contributions:
Comment about this patent or add new information about this topic: