A 1.4×FO4 self-clocked asynchronous serial link in 0.18 µm for intrachip communication

Y. Zhang,⁎, R. Dobkin,⁎, A. Unikovski, D. Nahmanny, G. Samuel, M. Moyal, R. Ginosar

⁎ Corresponding authors.
E-mail addresses: yongxin@alumni.technion.ac.il (Y. Zhang), ran@ee.technion.ac.il (R. Ginosar), reuven@vsyncc.com (R. Dobkin), aharonu@towersemi.com (A. Unikovski), danniel.nahmanny@intel.com (D. Nahmanny), goel@ee.technion.ac.il (G. Samuel), efergan@yahoo.com (M. Moyal).

A 1.4×FO4 self-clocked asynchronous serial link in 0.18 µm for intrachip communication

Y. Zhang a,⁎, R. Dobkin c,⁎, A. Unikovski b, D. Nahmanny d, G. Samuel e, M. Moyal e, R. Ginosar a,⁎

a Department of Electrical Engineering, Technion–Israel Institute of Technology, Haifa 3200003, Israel
b Tower Semiconductor Migdal Haemek 2310302, Israel
c vsync Circuits, Yokneam Ilit 2099021, Israel
d Intel, Haifa 31015, Israel
e Toga Networks, Hod Hasharon 4524075, Israel

ARTICLE INFO

Keywords:
On-chip interconnect
Serial links
Current-mode receiver
Asynchronous circuits

ABSTRACT

In this paper, we describe a repeater-free asynchronous serial link architecture targeting 1×FO4 bit time for on-chip communication. Non-Return To Zero (NRZ) Data/Strobe code is used in the channel to achieve the target speed. Timing pulse trains are generated locally and are employed to drive high speed 'transition latches' in the serializer and deserializer. A RLC line model is derived by the HFSS electromagnetic solver. Inverter-based transmitters and receivers are found to perform faster than other circuits. A prototype device having 30 links and fabricated in Tower Semiconductor 0.18 µm CMOS process is described. Measurement results show 3.73 Gb/s data rate over 6.1 mm wire interconnect, corresponding to 1.44×FO4 bit time.

1. Introduction

With the increase of CMOS integrated circuit die size and decreasing feature size, the performance bottleneck of high speed VLSI circuits has shifted from logic gates to wires. An asynchronous high-speed wave-pipelined bit-serilal link that achieves data rate of single gate delay (FO4) bit time for on-chip communication is proposed in [1]. However, the current mode transmitter/receiver pair described in [1] does not achieve the desired data rate (67 Gb/s in 65 nm CMOS). This paper follows the bit-serilal link architecture [1] while improving the transmitter, receiver and delay element circuits. The circuits have been fabricated and their performance measured. Possible applications of on-chip high-speed bit-serilal links include point-to-point interconnect and router-to-router high speed channels in NoC.

The link architecture [1] is shown in Fig. 1. A digital controller stores 16-bit input data into the two shift registers (SR) [2] and triggers a Pulse Train Generator, which generates two eight-transitions pulse trains T0, T90 with 90° phase difference. The two high speed SRs, based on 'transition latches' (XL), are driven by T0, T90 to serialize the data. The Data/Strobe encoder interleaves the serial even and odd bits from the SRs to form the Data/Strobe bits at double the data rate of the SRs. Analog transmitter/receiver pairs interface the link wire. This paper describes several alternative transmit and receive circuits, including current-mode and inverter-based circuits. In addition, a new wire topology is employed, as well as modified RLC line model. At the receiver side, the Data/Strobe decoder recovers the T0, T90 pulse trains and the data stream, splitting it into the two SRs. Once the 16 bits have been shifted into the SRs, completion is indicated and the digital controller reads the data and counts bit and frame errors.

A 4.5 × 4.5 mm test chip with 30 different links has been fabricated in Tower Semiconductor 0.18 µm CMOS (Fig. 17). All links share the same structure shown in Fig. 1. Each link occupies area around 0.28 mm². The die-photo marks the transmitter and receiver areas, the digital controller and a ring oscillator for parametric and performance measurements.

The rest of the paper is organized as follows: The next section reviews related work. Section 3 introduces the RLC line model employed in this research. Section 4 details the components of the on-chip serial link, describes alternative analog transmit and receive circuits and presents the test environment. Section 5 shows the measured results and analyzes and compares the performance of various serial links. Finally, the paper is concluded in Section 6.

2. Related work

A reliable transmission line model is essential for the design of high speed on-chip interconnect. An RC model for such interconnect is presented and simulated in [3], matching Elmore's delay [4]. However,
the RC model is valid only when line resistance dominates, $R \gg j\omega L$. The estimated inductive reactance of the wires in our experiment, at frequencies above 5 GHz and based on [5], is larger than the wire resistance. Using a line model derived from wire dimensions by means of the electromagnetic solver in Agilent ADS simulator is reported in [6]. A field solver was employed in [7] to compute S-parameters for the transmission line, matching their RLC model for the line. While PTM [5] can extract RLC parameters from line geometry, the resulting characteristic impedance does not match the result obtained from the electromagnetic solver in HFSS [8]. Hence, we use the result from HFSS to derive relevant RLC parameters in this paper while employing balanced differential signaling topology as in [1].

Several methods have been proposed for high-speed, long range, area and energy efficient bitserial links for on-chip communication in large Systems-on-Chip. The Wave-front train serialization scheme [9] achieves 1.6 Gb/s in measurements, but the receiver circuit cannot be scaled to higher data rates. By modulating transmitter energy to higher frequencies, the pulsed current method [7] achieves 8.0 Gb/s in measurements, corresponding to 2×PO4 bit time in 0.18 μm process. A wave-pipelined interconnect for networks-on-chip [10] is simulated at 5.45 Gb/s over 10 mm link length, using an interleaved voltage-mode driver, sampler and resistively terminated transmission line. A 2.5D silicon interposer I/O is designed in [13], achieving 24 Gbps data rate with 12 channels. The power consumption is 7.5pJ/bit in GF 65 nm process over 3 mm T-line.

A transceiver for global on-chip communication [14] consists of a nonlinear charge-injection transmit filter and a sampling receiver with transimpedance pre-amplifier. It achieves measured 4 Gb/s over 10 mm interconnect in a 90 nm CMOS process. A capacitive driven pulse-mode wire using a transmit-side adaptive FIR filter and a clockless receiver is shown in [15]. 4.9 Gb/s is measured over 5 mm interconnect in a 90 nm CMOS. A serializer and transceiver [16] serializes the data with digitally tuned phase interpolator, achieving 9 Gb/s data rate over 5.8 mm lossy interconnect in a 0.13 μm CMOS. The line driver consists of two time-multiplexed inverters, each operating at one half the line frequency.

Current mode transmitter and receiver circuits are often employed in high speed on-chip signaling. A current mode circuit in [3] is designed to achieve higher data rate and better power efficiency. A Modified Clamped Bit-Line Sense Amplifier (MCBLSA) receiver circuit [17] is also more power efficient than optimal repeaters. A Robust Multi-Level Current-Mode On-Chip Interconnect Signaling circuit [18] targets shorter delay than optimally inserted repeaters, but higher power is consumed. The Differential Leakage-Aware Sense Amplifier (DLASA) [19] reduces leakage and static power in advanced process (70 nm and beyond) without affecting performance. An energy-efficient multi-bit quaternary current mode signaling circuit [20] shows measured 2.3 Gb/s data rate in a 0.13 μm process. The Regulated-Cascode Trans-Impedance Amplifier (RGC TIA) used as the front-end preamplifier in optical receivers [21] achieves 1.25 Gb/s in 0.6 μm process. A power efficient capacitive pre-emphasis transmitter and decision feed-back equalization (DFE) receiver that can achieve 2 Gb/s in 90 nm CMOS over 10 mm on-chip interconnects is described in [22]. It requires synchronized transmit and receive clocks. Circuits for capacitively driving long on-chip wires are shown in [23]. 1 GHz performance is measured over 8 mm interconnect in a 0.18 μm process with low energy dissipation.

In this work we employ the RLC line model based on line's high speed electromagnetic behavior over the PTM model. The asynchronous serial link [1] is chosen for its one gate delay bit time and we redesign some internal circuits, such as clock generator circuit and transceivers. We investigate the RGC-TIA based current mode receiver for its high-speed and asynchronous property and adopt the Regulated Current Mode Receiver configuration in the system. Inverter based transmitter/receiver pair is also implemented for its simplicity, asynchronous property and most importantly for comparison. A test chip was fabricated in 0.18 μm technology to verify the concept of this serial link.

In summary, issues related to high speed on-chip serial links may include throughput, latency, power, signal integrity, area, encoding, clocking and synchronization. This work focuses exclusively on maximizing data rate, trading off other properties. The goal of the investigation on serial links is to find a replacement for long parallel on-chip links requiring high area, incurring high congestion and consuming a lot of power. Having a fast serial link, mitigates the problems above allowing for the same data rate.

### 3. Transmission line model

One of the critical components of the serial link presented in this paper is the serial interconnect. Correct modeling of the interconnect is essential to optimize the link for a high performance operation.

#### 3.1. Model consideration

RLC line model is explored for our on-chip serial link. Since it is asynchronous, line latency is not a critical factor. Instead, attention is paid to attenuation and characteristic impedance $Z_0$. We utilize HFSS electromagnetic solver [8] to obtain $Z_0$ and derive $R$, $L$ and $C$ values accordingly.

A preliminary estimate of $L$ and inductive reactance is made using PTM [5]. With 5 μm line width, 5 μm spacing, 0.94 μm thickness and 0.82 μm distance from ground, inductance $L \approx 1.72 \text{ nH/mm}$. At 5 GHz, $\omega L \approx 54 \Omega/\text{mm}$, much larger than the resistance $R \approx 7.02 \Omega/\text{mm}$. This relation justifies using RLC rather than an RC line model [3].

To assess signal attenuation, we focus on estimating the $Z_0$ of the line following [24]. Signal amplitude at the end of X-units long transmission line is proportional to $e^{-\frac{\alpha X}{2}}$ [25]. We use the conventional RLC model. RLC parameters computed by PTM [5] lead to $Z_0 \approx 73.4 \Omega$. HFSS, however, estimates $Z_0 \approx 40 \Omega$. Since attenuation is critical in our design, we prefer the lower estimate from HFSS, resulting in higher estimate of attenuation.

#### 3.2. Line topology

Transmission line topology is shown in Fig. 2, which is a balanced...
differential signaling topology for current mode operation [1]. The thickness of the transmission line is determined by the selected process. To alleviate signal attenuation, wide and thick transmission lines are preferred. Longer and narrower lines lead to signal distortion, which limits the data rate. In the selected process, the largest line thickness (top metal layer) is 0.94 μm. Skin effect is negligible at the target frequencies: The signal rise and fall times are about 40 ps, resulting in effective frequency of 25 GHz and skin depth of δ = 0.5 μm where line thickness is less than 2δ. Since the skin effect is negligible, so is the proximity effect [26]. Line width and spacing were determined using simulations, optimizing for maximal length, while preserving 1×FO4 bit-time[7]. Line distance to ground is 0.82μm based on the process geometry. Line spacing is sufficiently large to render coupling capacitance negligible. The obtained parameter are: R = 6.02 Ω/mm, L = 0.14 nH/mm and C = 201.5 fF/mm at 5 GHz frequency wit the chosen parameters.

4. On chip serial link

In this section, the detailed circuits of Fig. 1 are described.

4.1. Pulse train generator

Fig. 3 shows the pulse train generator and Fig. 4 shows the waveforms. Pulses are generated rather than provided externally for two reasons: first, a very high pulse rate is needed, higher than can be delivered from an external source. Second, the link is asynchronous, not requiring precise clocking. The Pulse Train Generator diagram follows [27]. First, a start signal (a step function) is driven into a delay line. Each delay element introduces a basic delay \( D_1 \). Ideally, \( D_1 = \text{FO4} \), the delay of a single logic gate driving four identical gates. The entire delay line generates 16 \( D_1 \)-delayed versions of the input step, C1-C16 in Fig. 4, XORing C1-C16 as in Fig. 3 produces T0 and T90, having four pulse cycles (eight transitions) and separated by 90°. Each cycle is \( 4 \times D_1 \). The last XOR gate generates T at twice the rate of T0 and T90. The duty cycle of the generated signal is affected by the mismatch between delay elements. This is partially mitigated by delay element sizing. Besides, the circuit is asynchronous and thus insensitive to duty cycle variations, as long as minimum timing is met.

The delay element schematic is shown in Fig. 5. The delay is controllable, to enable testing at various data rates. A current starvation inverter circuit is used in the delay element with controllable bias voltage (Fig. 6) which regulates the charging and discharging current of the current starvation transistors, providing a wide range of delays.

The variable capacitive load on the OUT node of the delay element is controlled by digital signals D0-D3. Long channel transmission gates are employed in the capacitive load transistors. While in low frequency applications the load might consist of transistor gates, at high frequency the effective load is provided merely by the diffusion connected to the OUT node, and long channels enhance that switchable capacitance. The ineffective gate loads are eliminated and the other sides of the transmission gates are left unconnected.

Fig. 6 shows the bias voltage controller for the current starvation inverter of Fig. 5. By opening and closing the PMOS transistors, different MN/MP voltage values are generated. The first transistor is a minimum width PMOS. The combination of \( x1-x64 \) enables \( 2^7 = 128 \) different (MN,MP) voltage levels.
4.2. Shift register

The Shift Register (SR) in Fig. 1 consist of 16 Transition latches (XL) (Fig. 7). X, Xnot (which are either T0/T0N or T90/T90N) provide differential transitions that are spaced $2 \times D_1$ delays apart (Fig. 4). They control two parallel latches. Upon a transition, one latch becomes transparent and the other latch turns opaque. Setup and hold time for the latches are assured by relative delay analysis (the data path is longer than the X, Xnot paths) and multiple XLs may be chained without timing violations. X, Xnot propagate over multiple XL stages faster than the propagation of data, and data moves from one XL stage to the next one before the following transition of X, Xnot arrives. As the fastest link on this test chip achieves $1.36 \times \text{FO4}$ bit time, the shift register is successfully demonstrated at $2.72 \times \text{FO4}$ data rate. XL is an improved version of the design in [2].

4.3. Data/Strobe encoder

Data/Strobe (Level-Encoded Dual Rail – LEDR) code is a systematic code (the encoded bit stream contains the original data), simplifying decoding [28]. An encoding example is shown in Fig. 8. XORing the Data and Strobe bits at the receiver produces clock pulses at the same data rate as the data, enabling self-timed data recovery.

The Data/Strobe encoder circuit (Fig. 9) interleaves odd and even bits to achieve the Data/Strobe protocol of Fig. 8. The dual rail pass-transistor-logic-based high speed XOR [29] generates a sequence of transitions spaced by a single $D_1$ delay (Fig. 4). The transmission gate structure converts the data stream into Data/Strobe transitions. Rigorous symmetry layout for different modules, such as XL and SRs, is employed to achieve accurate delay matching.

\[
S(i) = \begin{cases} \overline{B(i)}, & \text{i odd} \\ B(i), & \text{i even} \end{cases} \\
D(i) = B(i) \land \overline{B(i)} \\
\]

Fig. 8. Example of Data/Strobe (LEDR) protocol encoding.

4.4. Analog link

Following the Data/Strobe encoder, maximum data rate signals are achieved in the circuit and need to be transmitted to the receiver through the analog channel. Due to high capacitance over the long interconnect, fast full-swing transitions result in high dynamic current, dissipate power and may cause crosstalk noise. Current mode signaling may help to reduce the high-voltage swing over the channel and may dissipate less power as well as achieve higher throughput compared with repeaters [3,17–20]. However, these reported circuits are not sufficiently fast for the FO4 data rate requirement [1].

The analog receiver circuit used in [1] employs a Regulated-Cascode (RGC) input stage that is widely employed in RF and optical communication [30]. However, that circuit is not enough for $1 \times \text{FO4}$ operation. An enhanced version appending a trans-impedance amplifier (RGC-TIA) is reported in [21]. In this section, we explore a current mode driver, two current-mode receiver circuits, RGC-TIA and a modified RCVR, and an inverter-based circuit.

4.4.1. Adaptively loaded current mode driver

For current mode links we have employed the adaptive control driver following [1]. It is designed to address the frequency-dependent degradation due to characteristic impedance dependence on frequency.

The inverter and the AND gate in Fig. 10 constitute an inertial delay. It controls a variable load on the driver output. When the input is stable, the driver strength is reduced and when the input toggles fast, the AND gate never turns on and the driver strength is increased. With completely symmetric design, the adaptive control is capable of handling slow-to-fast data transients [1].
4.4.2. Regulated current mode receiver

RGC-TIA configuration [31,21] is illustrated in Fig. 11. The low impedance RGC input stage determines the non-dominant pole at the input stage that enables high bandwidth. The feedback resistor $R_f$ guarantees high bandwidth response, and does not affect the DC bias [31]. M2 is used to isolate the large input capacitance from the M3 common source stage, as well as to shift the signal voltage level. M4 and M5 source followers are designed to adjust the DC voltage for the following operations [21].

However, for our application, the output DC voltage is too low to drive the digital circuit (inverter) directly. It is desirable to set the output DC voltage around VDD/2. One approach could be changing the last stage from source follower to a common source; however, two common source stages in succession would limit the bandwidth. The last two source follower stages do not contribute to the bandwidth. A reasonable DC voltage could be achieved by removing these two stages and applying the feedback from the drain of M3 to the drain of M1.

DC drain voltage of M1 is chosen to be 1/3Vdd to ensure the transmitter works in saturation. Gate voltage is optimized with different combinations of MB and RB (Fig. 12), realizing that the higher the current, the higher the bit rate. The DC current of C1 is set to 0.1 mA for idle power consideration. R0 and C2 guarantee that M2 works in saturation. M3 and R3 combination ensures that the output signal is biased around VDD/2. Higher current through the last stage, achieved by lower R3 and wider M3, results in higher voltage swing at the output, closer to full swing.

It turns out, however, that removing the feedback resistor $R_f$ leads to better performance. The feedback limits the operating frequency of the circuit as the loop has to close before it can be considered as a trans-impedance. When $R_f$ goes to infinity, no feedback exists and the circuit operates faster in open loop. The resulting Regulated Current Mode Receiver (RCMR) is shown in Fig. 12. The current sources C1 and C2 are biased by two standalone current mirrors.

Since the driver (Fig. 10) is differential, two receivers of Fig. 12 are needed. The current return path goes through VDD of the first part of the two receiver circuits (rather than through GND nodes, isolated from the current path by the current sources C1). The VDD nodes are physically placed close to each other to facilitate the return path.

The RCMR circuits satisfies the 1×FO4 data rate in simulation. However, the circuit, especially the DC output voltage, is sensitive to variations. In the test chip we had to apply voltage levels higher than nominal VDD (1.8 V), and even though the circuit did not achieve high performance. Another disadvantage of this circuit is constant current consumption, regardless of the data rate.

4.4.3. Inverter based driver and receiver

To counter the expected deficiencies of the current mode circuit described above, simple inverter-based transmitter and receiver pair is studied in the test chip (Fig. 13).

A large driver enables a high performance link operating at 3.73 Gb/s over 6.1 mm interconnect. The signal wavelength, 51 mm at 2.0 GHz in aluminum, is much longer than the 6.1 mm wire length, and hence the interconnect is modeled as a lumped RLC. Employing inverters, unmatched to the line impedance, may be effective at these relative low frequencies. At much higher frequencies, when the wavelength is shorter than the wire length, circuits with matched impedance rather than inverters may be required.

The eye diagram for the simulated transmitter and receiver pair is illustrated in Fig. 14. It worked pretty fine at the data rate of 5.0 Gbps.

Other high speed transmitters and receivers are reported in [3,17–20]. Some circuits emphasize energy efficiency over maximum data rate. Future work may investigate such circuits in the context of asynchronous high speed links.
4.5. Data/Strobe decoder

The Decoder circuit is shown in Fig. 15. It splits incoming data into even and odd lanes, and generates transition sequences. Using the recovered Data and Strobe from the analog receiver, the dual rail XOR gate \([29]\) generates \(T\), \(TN\) transitions spaced one \(D_1\) delay apart. The two small back-to-back inverters at the XOR output help to align the complementary transitions. \(T\) and \(TN\) control the first (splitting) pair of data latches and feed into the Toggle circuit \([32]\) (Fig. 15(b)). The fast toggle receives a \(T\), \(TN\) transition every \(D_1\) delay (Fig. 4), divides that data rate by two and recovers the \(T0\), \(T90\) signals for the shift registers (SRs) at the receiver side. These SRs are similar to the SRs of the transmitter, described in Section 4.2. At the end of the pulse train, data is stored at the output of the SRs.

4.6. Testing environment

Fig. 16 shows the PCB used to test the serial link. Note that only one device is assembled on the board in the figure. However, up to five ICs may be tested simultaneously. A LabView GUI is designed to control the PCB board through a NI sbRIO9642 FPGA test interface board. All pads on the test chip are slow I/O, supporting only up to 100 MHz. The internal link signal cannot be exported outside the chip, and eye diagrams for internal fast signals are not available. A ring oscillator and a frequency divider can generate a slow output clock signal that enables measuring and estimating actual gate delays and working speed of the internal links.

5. Measurement results

The die photo of the 30 links test chip is shown in Fig. 17 with 30 Transmitter blocks on the left and 30 Receiver blocks on the right. The digital controller is placed in between. The die has been fabricated in Tower Semiconductor 0.18 μm CMOS process. The measured 41 inverter stage ring oscillator frequency (recorded externally after frequency division of 1024 by ten stages) is \(F_{\text{ext}} = 167\) kHz at 1.8 V supply voltage and room temperature, implying ring oscillator frequency of \(F_{\text{RO}} = 171\) MHz. This measured result can be compared to simulated \(F_{\text{ext sim}} = 235\) KHz. The same ratio was employed for estimation of link data rates, relative to their simulation counterparts. Similarly this is used for estimating FO4 bit time (Table 1), as follows.

\[
F_{\text{link-test}} = F_{\text{link-sim}} / F_{\text{ext-sim}}
\]

The chip is tested as follows. The digital controller in the chip communicates with the LabView GUI. Once control signals are set, the data is loaded into the shift registers of one of the 30 links. After loading, the start signal is sent to the Pulse Train Generator, triggering link operation. If the link has performed correctly, a latch enable (ACK) signal is valid at the output of the receiver SR, notifying the digital controller that the data has been received and can be read out. Digital

<table>
<thead>
<tr>
<th>Supply Voltage</th>
<th>1.63 V</th>
<th>1.80 V</th>
<th>2.00 V</th>
</tr>
</thead>
<tbody>
<tr>
<td>FO4 bit time</td>
<td>199 ps</td>
<td>186 ps</td>
<td>162 ps</td>
</tr>
<tr>
<td>Data Rate (6.1 mm)</td>
<td>3.65 Gb/s</td>
<td>3.73 Gb/s</td>
<td>4.1 Gb/s</td>
</tr>
<tr>
<td>Data Rate (1.38xFO4)</td>
<td>1.38xFO4</td>
<td>1.44xFO4</td>
<td>1.51xFO4</td>
</tr>
<tr>
<td>Link Energy</td>
<td>62.9 pJ/bit</td>
<td>78.1 pJ/bit</td>
<td>100.5 pJ/bit</td>
</tr>
</tbody>
</table>
controller compares the received and sent words for performance evaluation, computing Bit Error Rate (BER), Frame Error Rate (FER) and other results. If, however, the ACK signal is not available within a pre-determined time, BER is set to the word length (16) and FER = NTO = 1 (NTO is Number of Time Out events). The digital controller can also perform a long sequence of word transmits and receives.

Fig. 18 shows bit error rate (BER) for two links, one using Regulated Current Mode Receiver (RCMR) extending over 6.2 mm and the other using inverters as transmitters and receivers over a 6.1 mm long wire. BER charts are shown as functions of data rate and supply voltage. Each link operates properly (no error are detected) at data rates up to a certain point where errors start to appear. The error rate climbs steeply when data rate is increased thereafter. We report the highest rate at which there are still no errors in Tables 1–3. Evidently, the current mode link cannot operate at 1.80 V. It does work at a higher voltage. It is further evident that the inverter-based transmitter/receiver links achieve higher performance than the current mode link at all supply voltage levels. Note that in contrast with typical communications where BER is correlated with channel noise and very small error rates are acceptable, in this work errors result from the inability of the circuits to operate at high frequencies. Hence, rather than logarithmic description of minute error rates, we plot a simpler BER chart, demonstrating that errors are indeed due to frequency saturation.

In addition to FO4 delay based on supply voltage, Table 1 also indicates the data rate in terms of Gb/s and in FO4 units. As expected, higher voltage results in higher data rate. Link energy is also reported, computed by simulation.

Table 2 indicates the measured effect of link length on data rate, expressed in Gb/s and in FO4 units. There is no significant difference between the two short links (1.7 mm and 2.6 mm). The long link is only slightly slower.

Table 3 compares this work with other published works [9,7,12,13]. We demonstrate, for the first time, an on-chip link with data rate below 2×FO4. Our goal is 1×FO4 bit time data rate and test chip shows 1.44×FO4 with long interconnect. Although the interconnect power efficiency may appear lower than [7,13], we note that the DLL power of the pulsed current method has not been included in [7]. High power is incurred due to optimizing the circuits for high throughput only. Future research could address minimizing power consumption while maintaining high performance.

The estimated FO4 bit time is 186 ps in selected process. If the circuit can be made to achieve 1×FO4 bit time data rate, 5.37 Gb/s data rate will be available. Furthermore, if the inverter-based asynchronous link were implemented in the same 0.18 μm technology as [7], the 60 ps FO4 delay might imply data rate of 16 Gb/s.

The test circuits reported in [9], using the same process node and voltage as this work, achieved only 1.6 Gb/s on a shorter link. While [7] uses the same process as well, it seems to be much faster than the one of this work, having a three times shorter FO4 delay and resulting in a higher absolute data rate. When normalized to FO4 delays, 2.1 gate-delay data cycle is achieved. More recently, [12] demonstrated 22 Gb/s over 10 mm in a 65 nm process, verifying that advanced technology may enable higher speed. Neither [9] nor [12] reported bit times in FO4 terms.

6. Conclusions

We have demonstrated that the asynchronous Data/Strobe protocol (almost) achieves the 1×FO4 bit time goal over lossy on-chip transmission lines. A test chip consisting of 30 different links demonstrates 3.73 Gb/s data rate, translating to 1.44×FO4, over 6.1 mm link in 0.18μm CMOS. This implies fast serial links which may replace high speed asynchronous digital circuits providing comparable throughput that may consume less power.

The presented asynchronous link generates its own pulse train rather than resorting to external clocks. A fast shift register, based on novel ‘transition latches’, is employed for serialization and de-serialization. A fast toggle circuit was developed to de-multiplex transitions. Data/Strobe signaling is employed to enable asynchronous timing. A fast toggle circuit was developed to de-multiplex transitions. Data/Strobe signaling is employed to enable asynchronous timing. High speed asynchronous digital circuits operate at either 1×FO4 or 2×FO4 bit time.

Two signaling styles over the link wires have been investigated. An adaptive control current mode transmitter, differential wires, and Regulated Current Mode Receiver (RCMR) comprise the current mode link. The other link type uses two inverters as transmitter and two other inverters as receiver. The latter link performed faster in our 0.18 μm CMOS test chip, achieving 1.44×FO4. However, transmit and receive circuits, as well as the digital sections, which can achieve 1×FO4 bit time and operate at low energy should be developed for more advanced technologies.

Acknowledgment

This research was supported in part by a research Grant (number 880011) from Israel Ministry of Science.

Table 2
Inverter based links operating at 1.80 V.

<table>
<thead>
<tr>
<th>Line Length</th>
<th>Data Rate/Gbps</th>
<th>Data Rate/FO4</th>
</tr>
</thead>
<tbody>
<tr>
<td>1.7 mm</td>
<td>3.89</td>
<td>1.38</td>
</tr>
<tr>
<td>2.6 mm</td>
<td>3.96</td>
<td>1.36</td>
</tr>
<tr>
<td>6.1 mm</td>
<td>3.73</td>
<td>1.44</td>
</tr>
</tbody>
</table>

Table 3
Comparison with other works.

<table>
<thead>
<tr>
<th>Work</th>
<th>Process</th>
<th>Method</th>
<th>Design goal</th>
<th>Data rate</th>
<th>Bit time (FO4)</th>
<th>Supply voltage</th>
<th>Data rate (Gb/s)</th>
<th>Bit time (FO4)</th>
</tr>
</thead>
<tbody>
<tr>
<td>[9]</td>
<td>0.18 μm</td>
<td>Wave-Front</td>
<td>Interleaved</td>
<td>2×FO4</td>
<td>NA</td>
<td>1.8 V</td>
<td>1.8 V</td>
<td>2 (per channel)</td>
</tr>
<tr>
<td>[7]</td>
<td>0.18 μm</td>
<td>Sync Tx/Rx</td>
<td>VM Tx/Rx</td>
<td>65 nm</td>
<td>NA</td>
<td>1.8 V</td>
<td>1.2 V</td>
<td>2.1</td>
</tr>
<tr>
<td>[12]</td>
<td>NA</td>
<td>Interleave</td>
<td>1/O</td>
<td>65 nm</td>
<td>NA</td>
<td>1.8 V</td>
<td>1.2 V</td>
<td>1×FO4</td>
</tr>
<tr>
<td>[13]</td>
<td>NA</td>
<td>Interleave</td>
<td>1/O</td>
<td>65 nm</td>
<td>NA</td>
<td>1.8 V</td>
<td>1.2 V</td>
<td>1×FO4</td>
</tr>
<tr>
<td>This work</td>
<td>0.18 μm</td>
<td>Wave-Front</td>
<td>Interleaved</td>
<td>1×FO4</td>
<td>0.29 (+3.1)</td>
<td>1.8</td>
<td>1.44</td>
<td>1×FO4</td>
</tr>
</tbody>
</table>

Fig. 18. Link BER performance.
References


