[Next] [Up/Previous] [Index]

The Representation of Speech

Historically, the primary use of encryption has been, of course, to protect messages in text form. Advancing technology has allowed images and audio to be stored and communicated in digital form. A particularly effective method of compressing images is the Discrete Cosine Transform, which is used in the JPEG (Joint Photographic Experts Group) file format.

When sound is converted to an analogue electrical signal by an appropriate transducer (a device for converting changing levels of one quantity to changing levels of another) such as a microphone, the resulting electrical signal has a value that changes over time, oscillating between positive and negative.

A Compact Disc stores stereo musical recordings in the form of two digital audio channels, each one containing 44,100 16-bit signed integers for every second of sound. This leads to a total data rate of 176,400 bytes per second.

For transmitting a telephone conversation digitally, the same level of fidelity is not required. Only a single audio channel is used, and only frequencies of up to 3000 cycles per second (or 3000 Hertz) are required, which requires (because of a mathematical law called the Nyquist theorem) 6000 samples of the level of the audio signal (after it has been bandlimited to the range of frequencies to be reproduced, otherwise aliasing may result) to be taken each second.

For many communications applications, samples of audio waveforms are one byte in length, and they are represented by a type of floating-point notation to allow one byte to represent an adequate range of levels.

Simple floating-point notation, for an eight-bit byte, might look like this:

S EE MMMMM
0 11 11111  1111.1  
0 11 10000  1000.0
0 10 11111   111.11
0 10 10000   100.00
0 01 11111    11.111
0 01 10000    10.000
0 00 11111     1.1111
0 00 10000     1.0000

The sign bit is always shown as 0, which indicates a positive number. Negative numbers are often indicated in floating-point notation by making the sign bit a 1 without changing any other part of the number, although other conventions are used as well. For comparison purposes, the floating-point notations shown have all been scaled so that 1 represents the smallest nonzero number that can be indicated.

One way the range of values that can be represented can be extended is by allowing gradual underflow, where an unnormalized mantissa is permitted for the smallest exponent value.

S EE MMMMM
0 11 11111  11111000  
0 11 10000  10000000
0 10 11111   1111100
0 10 10000   1000000
0 01 11111    111110
0 01 10000    100000
0 00 11111     11111
0 00 10000     10000
0 00 01111      1111
0 00 01000      1000
0 00 00111       111
0 00 00100       100
0 00 00011        11
0 00 00010        10
0 00 00001         1

Another way of making a floating-point representation more efficient involves noting that, in the first case, the first mantissa bit (the field of a floating-point number that represents the actual number directly is called the mantissa because it would correspond to the fractional part of the number's logarithm to the base used for the exponent) is always one. With gradual underflow, that bit is only allowed to be zero for one exponent value. Instead of using gradual underflow, one could use the basic floating-point representation we started with, but simply omit the bit that is always equal to one.

This could produce a result like this:

S EEE MMMM
0 111 aaaa  1aaaa000
0 110 aaaa   1aaaa00
0 101 aaaa    1aaaa0
0 100 aaaa     1aaaa
0 011 aaaa      1aaa.a
0 010 aaaa       1aa.aa
0 001 aaaa        1a.aaa
0 000 aaaa         1.aaaa

Here, the variable bits of the mantissa are noted by aaaa, instead of being represented as all ones in one line, and all zeroes in a following line, for both compactness and clarity.

Today's personal computers use a standard floating-point format that combines gradual underflow with suppressing the first one bit in the mantissa. This is achieved by reserving a special exponent value, the lowest one, to behave differently from the others. That exponent value is required to multiply the mantissa by the same amount as the next higher exponent value (instead of a power of the radix that is one less), and the mantissa, for that exponent value, does not have its first one bit suppressed.

Another method of representing floating point quantities efficiently is something I call extremely gradual underflow. This retains the first one bit in the mantissa, but treats the degree of unnormalization of the mantissa as the most significant part of the exponent field. It works like this (the third column shows an alternate version of this format, to be explained below):

S EE MMMMM                         S M EE MMMM
0 11 1aaaa  1aaaa000000000000000   0 1 11 aaaa
0 10 1aaaa   1aaaa00000000000000   0 1 10 aaaa
0 01 1aaaa    1aaaa0000000000000   0 1 01 aaaa
0 00 1aaaa     1aaaa000000000000   0 1 00 aaaa

                                   S MM EE MMM
0 11 01aaa      1aaa000000000000   0 01 11 aaa
0 10 01aaa       1aaa00000000000   0 01 10 aaa
0 01 01aaa        1aaa0000000000   0 01 01 aaa
0 00 01aaa         1aaa000000000   0 01 00 aaa

                                   S MMM EE MM
0 11 001aa          1aa000000000   0 001 11 aa
0 10 001aa           1aa00000000   0 001 10 aa
0 01 001aa            1aa0000000   0 001 01 aa
0 00 001aa             1aa000000   0 001 00 aa

                                   S MMMM EE M
0 11 0001a              1a000000   0 0001 11 a
0 10 0001a               1a00000   0 0001 10 a
0 01 0001a                1a0000   0 0001 01 a
0 00 0001a                 1a000   0 0001 00 a

                                   S MMMMM EE
0 11 00001                  1000   0 00001 11
0 10 00001                   100   0 00001 10
0 01 00001                    10   0 00001 01
0 00 00001                     1   0 00001 00

Although usually a negative number is indicated simply by setting the sign bit to 1, another possibility is to also invert all the other bits in the number. In this way, for some of the simpler floating-point formats, an integer comparison instruction can also be used to test if one floating-point number is larger than another.

This definitely will not work for the complicated extremely gradual underflow format as it is shown here. However, that format can be coded so as to allow this to work, as follows: the exponent field can be made movable, and it can be placed after the first 1 bit in the mantissa field. This is the format shown in the third column above.

When this is done, for very small numbers the idea of allowing the exponent field to shrink suggests itself.

Thus, if the table above is continued, we obtain:

S EE MMMMM                              S MMMMM EE
0 11 00001                  1000        0 00001 11
0 10 00001                   100        0 00001 10
0 01 00001                    10        0 00001 01
0 00 00001                     1        0 00001 00

                                        S MMMMMM E
N/A                            0.1      0 000001 1
N/A                            0.01     0 000001 0

                                        S MMMMMMM
N/A                            0.001    0 0000001

Something very similar is used to represent sound signals in 8-bit form using the A-law, which is the standard for European microwave telephone transmission, and which is also sometimes used for satellite audio transmissions. However, the convention for representing the sign of numbers is different.

Also, if this method, with a two-bit exponent, were used for encoding audio signals with 16 bits per sample, the result, for the loudest signals, would have the same precision as a 14-bit signed integer, 13 bits of mantissa. Many early digital audio systems used 14 bits per sample rather than 16 bits. But the dynamic range, the difference between the softest and loudest signals possible, would be that of a 56-bit integer.

One problem with using floating-point representations of signals for digital high-fidelity audio - although this particular format seems precise enough to largely make that problem minor - is that the human ear can still hear relatively faint sounds while another sound is present, if the two sounds are in different parts of the frequency spectrum. This is why some methods of music compression, such as those used with Sony's MiniDisc format, Philips' DCC (Digital Compact Cassette), and today's popular MP3 audio format, work by dividing the audio spectrum up into "critical bands", which are to some extent processed separately.

Transmitting 6000 bytes per second is an improvement over 176,400 bytes per second, but it is still a fairly high data rate, requiring a transmission rate of 48,000 baud.

Other techniques of compressing audio waveforms include delta modulation, where the difference between consecutive samples, rather than the samples themselves, are transmitted. A technique called ADPCM, adaptive pulse code modulation, works by such methods as extrapolating the previous two samples in a straight line, and assigning the available codes for levels for the current sample symmetrically around the extrapolated point.

The term LPC, which means linear predictive coding, does not, as it might seem, refer to this kind of technique, but instead to a method that can very effectively reduce the amount of data required to transmit a speech signal, because it is based on the way the human vocal tract forms speech sounds.

There is a good page about Linear Predictive Coding at this site.

In the latter part of World War II, the United States developed a highly secure speech scrambling system which used the vocoder principle to convert speech to a digital format. This format was then enciphered by means of a one-time-pad, and the result was transmitted using the spread-spectrum technique.

The one-time-pad was in the form of a phonograph record, containing a signal which had six distinct levels. The records used by the two stations communicating were kept synchronized by the use of quartz crystal oscillators where the quartz crystals were kept at a controlled temperature. The system was called SIGSALY, and an article by David Kahn in the September, 1984 issue of Spectrum described it.

Speech was converted for transmission as follows:

The loudness of the portion of the sound in each of ten frequency bands, on average 280 Hz in width (ranging from 150 Hz to 2950 Hz), was determined for periods of one fiftieth of a second. This loudness was represented by one of six levels.

The fundamental frequency of the speaking voice was represented by 35 codes; a 36th code indicated that a white noise source should be used instead in reconstructing the voice. This was also sampled fifty times a second.

The intensities of sound in the bands indicated both the loudness of the fundamental signal, and the resonance of the vocal tract with respect to those harmonics of the fundamental signal that fell within the band. Either a waveform with the frequency of the fundamental, and a full set of harmonics, or white noise, was used as the source of the reconstructed sound in the reciever, and it was then filtered in the ten bands to match the observed intensities in these bands.

This involved the transmission of twelve base-6 digits, 50 times a second.

Since 6 to the 12th power is 2,176,782,336, which is just over 2^31, which is 2,147,483,648, this roughly corresponds to transmitting 200 bytes a second. This uses only two-thirds of the capacity of a 2,400-baud modem, and is quite a moderate data rate.

The sound quality this provided, however, was mediocre. A standard for linear predictive coding, known as CELP, comes in two versions which convert the human voice to a 2,400-baud signal or to a 4,800-baud signal.

[Next] [Up/Previous] [Index]

Next
Skip to Next Section
Table of Contents
Home Page