# VLSI Implementation of the Receiver for 8-PSK Trellis Coded Modulation with Phase Ambiguity Resolution<sup>\*</sup>

S. Benedetto<sup>a</sup>, V. Magnani<sup>b</sup>, M. Mondin<sup>a</sup>, F. Pasello<sup>b</sup>

<sup>a</sup>Dipartimento di Elettronica, Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129 Torino, Italy

<sup>b</sup>Satellite Communication Systems, Siemens Telecomunicazioni SpA, S.S. Padana Superiore km 158, 20060 Cassina de' Pecchi (MI), Italy

# Abstract

The design and the implementation of a receiver for 8-PSK trellis-coded modulation based on the Viterbi algorithm with phase ambiguity resolution is described, in the framework of an application related to satellite transmission of digital HDTV. The optimum 8-state code with asymptotic gain of 3.6 dB with phase invariance of 360° has been considered. Several solutions for the phase ambiguity resolution in a COPSK demodulator have been investigated and a solution based on the Viterbi decoder branch metric observation has been chosen for the implementation. The Viterbi soft decision decoder ASIC has been realised in 1.0  $\mu$ m drawn gate length HCMOS technology. A complete 44.736 Mbit/s modem has then been realised and tested.

# 1. INTRODUCTION

The problem of receiving high-capacity satellite downlink carriers like the ones envisaged for digital HDTV with relatively small aperture antennae can be successfully handled through the adoption of trellis-coded modulation (TCM) schemes. Among the possible signal constellations, coherent octal PSK (COPSK) is highly attractive because of its robustness with respect to channel nonlinearities and its bandwidth compatibilities with QPSK. In [1] several code alternatives were examined, allowing a coding gain of the order of 3 dB with a reasonable decoder complexity. TCM schemes employing 8-PSK two-dimensional constellations are not rotationally invariant, when optimized in terms of coding gain; this means that code sequences rotated by multiples of  $\pi/4$  do not belong to the code, so that differential decoding like in conventional memoryless PSK receivers cannot be

<sup>\*</sup>This work was sponsored by ESA/Estec under contract No. 7763/88/NL/JG(SC)

applied. A way of solving the consequent phase ambiguity consists in sending a known training sequence of data at the beginning of the transmission, so as to identify the correct phase. However, when a carrier synchronization loss occurs during transmission, the subsequent decisions become completely unreliable.

Two possible solutions to the phase ambiguity problem have been proposed in [2] and [3]. In [2], the Viterbi decoder is provided with some hardware, which decides upon the correctness of the acquired phase on the basis of the values of accumulated distances between the received signal and the most likely signal emerging from the state possessing the best metric in the Viterbi algorithm. Conversely, in [3] the authors propose to differentially encode the signals after the encoder, and then use a suboptimum decoder instead of the maximum likelihood one.

In this paper, we describe the design, implementation and testing of a high-speed, 8-state TCM COPSK receiver, which incorporates a VLSI chip performing the maximum likelihood decoding and the phase ambiguity resolution according to an algorithm working on the correlation metrics of the Viterbi algorithm.

The TCM code which has been implemented is the best 8-state OPSK scheme (in terms of free Euclidean distance and number of nearest neighbours) available in the literature. Its characterizing parameters, i.e octal representation of the code vectors describing the rate 2/3 convolutional codes, free Euclidean distance, number of nearest neighbours, asymptotic gain with respect to uncoded QPSK and rotational invariance are reported in Table 1. The design criteria leading to the choice have been described in [1].

| Nstates | m | $h^2$ | $h^1$ | $h^0$ | $	ilde{k}$ | $d^2$ free | <b>n</b> free | Y      | Invariance |
|---------|---|-------|-------|-------|------------|------------|---------------|--------|------------|
| 8       | 3 | 4     | 2     | 11    | 2          | 4.586      | 2             | 3.6 dB | 360°       |

Table 1:Code Characteristics

# 2. PHASE AMBIGUITY RESOLUTION BY BRANCH METRIC OBSERVATION

We have investigated in [5] several possible alternatives for the phase ambiguity resolution, consisting in variations of the approach described in [2]. The final choice, implemented in the VLSI chip and described thereafter, offers the best trade-off between performance and implementation complexity.

### 2.1. Description of the Phase Ambiguity Resolution Algorithm

The Phase Ambiguity Resolution Algorithm (PARA)<sup>\*</sup> consists of a *misalignment* detector, which derives information on whether the proper carrier phase has been recovered or not. This information is passed to a phase shifter, which, in turn, in case of a declared misalignment, sequentially counter-rotates the demodulated constellation by multiples of  $\pi/4$  until the correct phase has been achieved. The misalignment detector exploits the fact that, in the presence of a correct phase recovery, there is a high correlation between the received signal and the highest branch-metric signal emerging from the best path-metric state in the code trellis. On the contrary, this correlation is reduced when the phase recovery is incorrect.

In formulae, let  $m_k$  denote the highest branch metric emerging from the best path-metric state  $S_k$  during the k-th symbol interval:

where  $i \in \{0,...,7\}$  and  $\varphi_i = i\pi/4$  while (X, Y) are the coordinates of the received signal point. Then, the algorithm evaluates the decision variable *C* by adding *N* consecutive branch metrics as in:

$$C = \sum_{k=L}^{N+L-1} m_k$$

and compares *C* with a threshold *T*, declaring a correct phase compensation (hypothesis  $H_0$ ) or an incorrect compensation (hypothesis  $H_1$ ) according to whether *C* is greater or smaller than *T*. This misalignment detector is combined with a phase shifter, which sequentially counter-rotates the demodulated constellation of multiples of  $\pi/4$  until a correct phase is detected. Since the branch metrics (1) are already available by the Viterbi decoder, the algorithm requires a minimum amount of extra hardware and can be easily incorporated into the ASIC.

### 2.2. Performance Parameters of the Algorithm

The performance of the algorithm can be stated evaluating the "*false alarm*" and "*miss*" probabilities,  $P_{fa}$  and  $P_m$  [4]. The false alarm event occurs when C does not reach the threshold T in the  $H_0$  condition, whereas the miss event refers to the situation in which C is larger than threshold in the  $H_1$  condition. The two events of wrong detection play different roles in the behaviour of the detection algorithm. The *false alarm* event is the most undesirable, since it causes the loss of a correct phase: a very low value of  $P_{fa}$  is then required ( $P_{fa} \approx 10^{-30}$ ). The probability of *miss* can assume higher values; however, it should be lower than the bit error probability, in order not to degrade the decoder performance.

<sup>\*</sup>Patent pending.

A semianalytical method, based on the hypothesis of Gaussian distribution for *C*, has been used to compute the two probabilities of *miss* and *false alarm*. The mean value and the variance of *C* have been determined by simulation and the threshold value *T* necessary to obtain the desired probability of *miss* has been evaluated. The Gaussian assumption for *C* has been verified by comparing analytical and simulation results for  $P_m$  [5]. Given the previous hypothesis, it has been verified that a window length of the order of 4000 is required to obtain a value of  $P_{fa}$  of the order of  $10^{-30}$ , as desired. In practice, the value N = 4096 has been chosen for convenience.

#### 2.3. Quantization Effects

Once a phase error has been detected, it is resolved by properly rotating the quantized demodulated components  $(X_q, Y_q)$  in order to obtain the new coordinates  $(X_q', Y_q')$ . The coordinates  $(X_q, Y_q)$  are obtained by passing the ideal demodulated coordinates (X, Y) inside a uniform quantizer with saturation threshold  $Amp_s$ . Since  $(X_q', Y_q')$  are represented with the same number of bits as  $(X_q, Y_q)$ , a rotation will induce a saturation effect on them. If, for instance, a counterclockwise rotation of  $\pi/4$  radians is required, the effect is the one visualized in Fig. 1, where it can be seen that all the points lying in the upper dark triangle will be hard clipped.



Fig. 1 Joint effect of the constellation rotation and the components saturation when  $Amp_s = 0.9$ .

The same phenomenon occurs for all rotations equal to odd multiples of  $\pi/4$ . This saturation effect depends on the value  $Amp_s$  of the quantizer threshold. Decreasing its value enhances the saturation effect, whereas increasing it leads to a regular behaviour of the rotation operation, but, on the other hand, entails a poor exploitation of the quantizer range and, as a consequence, degrades the decoder performance.

To optimize the choice of  $Amp_s$ , we have used the curves of Fig. 2 and 3, obtained by simulation and (for  $P_{fa}$ ) semi-analytically.



From Fig. 2 we observe that the value of  $Amp_s$  that minimizes the bit error probability is in the range  $0.5 \div 0.7$ , while from Fig. 3 we observe that the value of  $Amp_s$  that minimizes the *false alarm* probability is in the range  $0.8 \div 1.0$ . For this reason, the value of  $Amp_s = 0.78$  has been chosen, which corresponds to the knee of the *false alarm* probability curve of Fig. 3.

#### 2.3. Simulations with the Design Parameters

The performance of PARA has been evaluated with a signal-to-noise ratio Eb/No = 5 dB, which corresponds to a BER of roughly  $10^{-3}$  of the considered COPSK TCM scheme (see [6]). As seen before, a window of length N = 4096 has been used, such that  $P_{fa} \approx 10^{-30}$ , and the value of the threshold *T* has been chosen in order to obtain  $P_m = 10^{-3}$ . The complete system used in the simulations includes the automatic gain control (AGC), the transmit and receive filters and the carrier recovery system. The obtained bit error probability (together with the mean value)

and variance of the normalized estimator C = C/N is reported in Table 2, where the first row refers to an ideal carrier recovery circuit and the second row to the Costas loop actually implemented in the demodulator. The values of the bit error probability are almost coincident with the ideal ones.

| Mean Value | Variance | Pb(e) (BER)           |  |  |
|------------|----------|-----------------------|--|--|
| 20.3544    | 0.00705  | 1.63·10 <sup>-3</sup> |  |  |
| 20.3288    | 0.00705  | 1.81·10 <sup>-3</sup> |  |  |

Table 2:Bit error probability  $P_b(e)$  and mean value and variance of the<br/>estimator C' measured by simulation, with a rotation  $\theta = \pi/4$ <br/>introduced by the Viterbi decoder. The simulation has been performed<br/>in the  $H_0$  condition, using 204800 information symbols.

Before releasing the ASIC design we have also verified that the "hardware" description of the decoder worked properly. In particular, we compared the mean value and variance of the estimator C in the two working conditions as obtained by system simulation and through the hardware development tools. The results are reported in Table 3.

| Description | H <sub>0</sub> conditio | on (θ=π/4) | $H_1$ condition ( $\theta=\pi/4$ ) |          |  |
|-------------|-------------------------|------------|------------------------------------|----------|--|
| 1           | Mean Value              | Variance   | Mean Value                         | Variance |  |
| Software    | 20.3288                 | 0.00705    | 19.0675                            | 0.01144  |  |
| Hardware*   | 20.4016                 | 0.00744    | 19.0870                            | 0.0089   |  |

Table 3: Mean value and variance of the estimator C' in the H<sub>0</sub> and H<sub>1</sub> conditions measured by simulating 50 windows (204800 symbols) with the TOPSIM-IV package and using different rotation  $\theta$  introduced by the Viterbi decoder, for N = 4096, Amp<sub>s</sub> = 0.78, E<sub>b</sub>/N<sub>0</sub> = 5 dB. Note (\*):  $\theta = 5\pi/4$  in H<sub>0</sub> condition.

# 2.3.1. Implementation Choices

On the basis of the simulation results, the values of the main parameters controlling the algorithm behaviour have been determined. They are given in the Table 4, together with the evaluated values of the *miss* and *false alarm* probability. Two sets of parameters have been programmed in the chip, optimized for signal-to-noise ratios of 4 and 5 dB. A control pin is used to choose between the two settings.

| <i>Eb/N0</i> [dB] | Receiver Status | Ν    | T       | Т      | Pfa               | $P_m$              |
|-------------------|-----------------|------|---------|--------|-------------------|--------------------|
|                   | $H_1$           | 4096 | 19.3793 | 79378  | 10 <sup>-31</sup> | 10 <sup>-3</sup>   |
| 5                 | $H_0$           | 8192 | 19.3793 | 158755 | 10-62             | 6·10 <sup>-6</sup> |
|                   | $H_1$           | 4096 | 19.1270 | 78344  | 10-14             | 1.4.102            |
| 4                 | $H_0$           | 8192 | 19.1270 | 156688 | 10 <sup>-27</sup> | 10 <sup>-3</sup>   |

Table 4:Values of the window length and threshold of the phase ambiguity<br/>resolution algorithm, optimized for signal-to-noise ratios of 4 and 5<br/>dB. T'=T/N is the normalized threshold.

As shown in Table 4, the parameters settings depend on the receiver status. When the receiver is in the  $H_0$  status, i.e. a correct phase situation, the window length is doubled with respect to the opposite  $H_1$  status. This is done in order to further decrease the false alarm probability. On the other hand, in the  $H_1$  status, the aim is to limit the time required to achieve synchronization, and thus a shorter window is used.

#### 3. IMPLEMENTATION

A certain degree of flexibility has been one of the guidelines followed in the whole design: as an example, the ASIC can be operated also in burst mode when using a suitable external unique word detector [7].

#### 3.1 ASIC Implementation

Different functions have been implemented in the ASIC: the Viterbi decoding, the PARA and a Costas loop phase detector (the block diagram is reported in Fig. 4). The 8 bit in-phase and quadrature components (P, Q) entering the ASIC are rounded to 6 bits after phase rotation: therefore, the eight branch metrics are quantized by 7 bit. The four path metrics, corresponding to the four branches entering each state of the trellis, are calculated by adding the current branch metric to the previously computed state metrics and then the best path metric is selected as new state metric. An amplitude scaling is performed to maintain the metrics inside the range allowed by the adopted 9 bit natural binary representation: when all the metrics exceed 1/4 of the total range, this quantity is subtracted from each of them with a suitable mapping of the two most significant bits.

The device has been realised in a  $1 \ \mu m$  HCMOS process with an overall complexity of 28386 equivalent gates and can be operated up to the maximum speed of 70 Mbit/s; the power consumption at 44.736 Mbit/s is 1 W. The design



Fig. 4 Block Diagram of the Decoder ASIC (Top Level)

itself has followed a rather traditional flow, but VHDL synthesis tools have been used to optimise the speed of critical sections.

#### 3.1.1 Memory Management

The three register trace-back method has been used to manage the decoding memory using the choice of the best state metric and a decoding depth of 20 symbols. Three 2-port  $16\times10$  RAM's have been integrated to store, for each state of the trellis, the information for trace-back and one 2-port  $2\times10$  RAM to record the bits decoded during the decision phase (see Fig. 5). For each state of the trellis, the path covered during 20 decoding steps is first stored into the memory blocks and then the state with the best metric is selected; the trace-back and decoding starts from this last state.

Three different operations are performed by each one of the three memory blocks: writing, trace-back and decoding. When in the writing phase 2 bits of information per state are stored at every decoding step while during trace-back and decoding the information is read from the memory and the path is covered backwards; besides, during decoding the decisions are taken on the decoded bits. The three memory blocks are used as circular buffers in which writing and reading are made in the two opposite directions alternatively so the three RAM's are at the same time in three different conditions; the memory works according to the following guidelines:



Fig. 5 RAM Organization for the Three Register Trace-Back Method

- reading occurs one clock period before writing, therefore, decoding and writing are possible on the same memory;
- all the RAM's have the same reading and writing addresses provided by a decimal up-down counter;
- decoding must occur within a symbol period; since during this phase the decoded bits are provided in opposite order, they need to be stored in a new RAM so that reading in the correct order can be possible.

This kind of architecture leads to a total processing delay of 30 symbol periods.

#### 3.1.2 Ambiguity Resolution

The PARA has been implemented by comparing separately (according to the value of the least significant bit of the most likely state) the four even branch metrics or the four odd ones to select the best metric emerging from the most likely state and then by accumulating this value to the previous sum. A programmable counter times the procedure and selects the window length (4096 or 8192 symbol periods). Two externally programmable threshold values have been implemented: 79424 and 78400 (for a window of 4096 symbols). A status bit is made available

externally to be used as an alarm flag for the misalignment condition: this bit can also be used to denote the output bits as erasures: an external decoder can take advantage of this information when using the device in a concatenated code environment.

#### 3.2 Demodulator Implementation

The 44.736 Mbit/s demodulator developed for this first application of the ASIC is rather a conventional one, being basically a modification of the 147.456 Mbit/s QPSK modem designed for ITALSAT (the Italian regenarative domestic satellite) ground stations; the block diagram is reported in Fig. 6. The first block is a 140 MHz IF subunit including image rejection filtering, level range setting and AGC amplifiers: an input dynamic range of 15 dB is guaranteed. The module performing actual demodulator functions is the following one, including the quadrature demodulator and the baseband shaping filters; a dual flash 8-bit ADC (type AD9558) is used to fed the input of the ASIC. An item which can affect the behaviour of the PARA is the AGC loop: an inexpensive but stable digital amplitude detector has proved to be effective for that purpose.



Fig. 6 Complete Demodulator Block Diagram

#### 3.2.1 Filters

The square root of a 50% roll-off factor raised cosine has been approximated for each one of the baseband components by a 6<sup>th</sup> order filter followed by one all-pass cell performing group delay equalization. Wide band current feedback operational amplifiers (type AD9718) are used both for isolation and gain purposes inside the filter structure. The accuracy of the shaping filter from DC to Nyquist frequency is within 0.5 dB and  $\pm 2$  ns for amplitude and group delay respectively. A better overall accuracy can be obtained by inserting a suitable SAW device, acting as image rejection filter too, in the IF section: this improvement will be applied to the next issue of the demodulator.

### 3.2.2 Synchronisation

The carrier is recovered using a classical Costas loop scheme: a digital phase detector has been integrated inside the ASIC effectively sharing gates with the phase rotator. The detector exhibits an adaptive behaviour, its gain being decreasing for lower signal-to-noise ratios (see Table 5); as a consequence, the loop equivalent noise bandwidth changes accordingly. For example, it is  $B_{eq} = 3$  kHz for Eb/No = 8 dB.

| <i>Eb/No</i> [dB] | $G_D/G_L$ |
|-------------------|-----------|
| 2                 | 5         |
| 4                 | 75        |
| 5                 | 100       |
| 6                 | 170       |
| 7                 | 250       |
| 8                 | 300       |

# Table 5:Ratio between the Gains of Digital and Analog Phase Detectors for the<br/>Costas Loop.

The clock synchronisation is performed using a simple arrangement of delay lines and exclusive OR gates: the obtained spectral line is cleaned up by a standard PLL used to lock a tunable crystal oscillator (VCXO).

#### 3.3. Measurements

The obtained experimental results for the main demodulator parameters are reported in the following paragraphs.

#### 3.3.1. Bit Error Rate

The BER performance of the equipment is reported in Fig. 7 together with the theoretical behaviour of an infinitely soft equivalent decoder. The curve for an ideal DQPSK system is also shown for comparison: it is important to note that the differential decoding *must* be included to obtain a representative system as far as the synchronisation behaviour is concerned. Taking into account typical implementation losses, the improvement over an actual DQPSK system is in excess of 3 dB, making suitable the adoption of small aperture dishes for the user terminal.



BER Performance of the Complete System

Fig. 7

#### 3.3.2. Carrier and Clock Cycle Skips

An often neglected aspect of coded modulation schemes is represented by the more stringent requirements affecting the design of the synchronisation circuits. The carrier and clock recovery loops must adopt a very narrow bandwidth to work properly down to the low signal-to-noise ratios allowed by the code gain. The leading parameter to correctly evaluate this kind of performance is the carrier cycle skip rate: Fig. 8 displays this important parameter as a function of Eb/No. The shadowed area represents the specification given for the 120.832 Mbit/s QPSK INTELSAT system, also adopted for the 147.456 Mbit/s QPSK ITALSAT up-link: this requirement is given only as a reference being no specification presently available for the described system.



Fig. 8

Cycle Skip Performance of the Demodulator Carrie rRecovery Circuit

# 4. CONCLUSIONS

A receiver for the optimum 8-state COPSK TCM scheme has been designed and implemented, in the framework of digital HDTV applications. It includes a custom designed VLSI integrated circuit performing the maximum likelihood decoding and the phase ambiguity resolution. The phase ambiguity resolution algorithm, designed for this application, has proved to be effective in bringing the theoretical advantages over conventional modulation schemes down to the practical implementation without affecting the system performance. The system is also suitable for applications involving an external block code (e.g. a Reed-Solomon one) in a concatenated environment.

# REFERENCES

- 1. S. Benedetto, C. Guerra, M. Mondin, A. Pincetti, F. Pasello "*Receiver Design* for 8-PSK Trellis-Coded Modulation in a TDMA Burst Mode Satellite Link", Proceedings of 1991 Tirrenia International Workshop, September 1991.
- 2. U. Mengali, A. Sandri, A. Spalvieri, "*Phase Ambiguity Resolution in Trellis Coded Modulations*", IEEE Transactions on Communication, Vol.38, No.12, December 1990.
- 3. A. P. Clark, S. W. Cheung, "*Performance of a Satellite Modem Transmitting Convolutionally and Differentially Encoded &PSK Signals*", International Journal of Satellite Communications, January-February 1993.
- 4. G. Lorden, R. J. Mc Eliece, L. Swanson, "*Node Synchronization for the Viterbi decoder*", IEEE Transactions on Communication, Vol.32, 1984.
- 5. S. Benedetto, M. Mondin, "Advanced Modulation Technique Design. Development and Test of a COPSK Modem Equipment - Final Report on WP 110 bis", March 1993.
- 6. S. Benedetto, M. Mondin, G. Montorsi, M. Basta, "Advanced Modulation Technique Design. Development and Test of a COPSK Modem Equipment -Final Report on WP 110", June 1991.
- 7. R. De Vizia, D. Valencic, "*Demodulatore 8-PSK TCM Burst Mode*", CEFRIEL Internal Report RI 92055, July 1992.