- F. K. Buelow
- F. B. Hartman
- E. L. Willette
- J. J. Zasio

# A Circuit Packaging Model for High-Speed Computer Technology

Abstract: An exploratory model has been constructed in a study of packaging and circuit techniques for a high-speed computer technology. An Arithmetic and Logic Unit capable of processing 64-bit words in floating-point format was fully designed. From this design a nucleus system comprising 424 circuits and 1838 transistors was abstracted, built, and tested. In this model, a delay of 2.2 nsec per level of logic was achieved in worst-case paths. This figure includes the wiring and power driver delays.

#### Introduction

This paper will describe the model that was used to evaluate and direct circuit and packaging development of a high-speed computer technology. Many circuits have been described which show promise of achieving speeds of 1 nsec per logic decision. These circuits are usually developed with sole emphasis on the speed of the logical switching operations. Most such 1 nsec circuits can perform only a very limited number of logic functions, with poor fan power, and have very restrictive wiring rules. These restrictions usually require that more than two levels of circuitry are needed to implement commonly encountered logic, such as a two-input AND driving a three-input or, which in turn drives three other circuits. Alternately, these common logic functions can frequently be performed in one level of transistor circuitry2 with 2 or 3 nsec delay. Quite obviously, the nominal delay of a circuit is not a good measure of potential system speed. Experience shows that the most satisfactory way to evaluate a circuit family is to use a system, preferably a large realistic system, as a test vehicle.

These considerations led to the *simultaneous* design of all aspects (logic, system, circuits, package) of a 20,000-transistor Arithmetic and Logic Unit (ALU). This simultaneous design procedure provided an insight into the interactions between packaging, circuit performance, and system performance. From the complete layout of this ALU, a 10% nucleus was abstracted as a data flow model

to be built and tested. This model has been constructed and evaluated. The evaluation suggests that tunnel-diode-coupled circuits<sup>3</sup> will considerably improve system performance.

This paper will outline the packaging techniques and system design of this model and the experimental results obtained.

# Packaging technique

The construction of a high-speed model presented two requirements which were sometimes contradictory. Dense circuit packaging was necessary to minimize transmission-line propagation delays, but circuits had to be constructed from readily available large components to minimize construction delays.

Usually, such circuit components are too large to permit dense packaging. The circuits used here, however, were based on a silicon transistor which, in the usual laboratory environment, need not be encapsulated. This transistor, mounted on a tab 50 mils square, has one gold and two aluminum one-mil diameter leads. The transistor package permitted the construction of small circuit modules.

# • Circuit modules

A 10-pin module was chosen as a basic cell. Some 20-pin

182

modules and one 30-pin module were used for functional combinations of circuits. A 20-pin module is shown in Fig. 1.

The pins on this module are located on a 0.1-inch grid. When indicators are required, 10-ma incandescent bulbs are used directly on the circuit module. Two bulbs are shown atop the module in Fig. 1. Resistors were conventional 1% carbon film resistors, usually IRC Type DM 1/8 w. When mounted as shown, and with the power levels of the circuits, a density of 100 resistors per square inch could be used. A pair of 10-ma germanium tunnel diodes packaged in 1/8" diam pills are mounted in the center of the resistor pack. These diodes are first mounted on a power supply bypass capacitor, as shown in Fig. 2, and then placed in the circuit module. The bottom wafer of the circuit module holds only transistors, as illustrated in Fig. 3.

The wafers are prepared from copper-clad epoxy glass. Double-sided wiring and through-plated holes are used. Lines are 3 mils thick and 20 mils wide. The heavy lines are used as current returns when the transistor leads are resistance welded to the lands. The lands are nickel plated to increase the interface resistance between the gold transistor lead and the copper lands.

This type of circuit module, while crude in appearance, can be adapted to many circuit configurations; it does not degrade circuit performance, and all parts can be readily procured and assembled.

Figure 1 A 20-pin circuit module, Type LD. This module is 0.7" wide, 0.4" long and 0.5" high.



## • Multilayer laminated card

The circuit modules are interconnected by a 12-layer laminated card. The cards used were  $9 \times 10 \times 0.118$  inches. Each card can hold 240 10-pin modules. The 10 pins of each module are inserted into through-plated holes. Figure 4 shows a card with modules. The capacitors around the card are for power supply bypassing. The strips of through-plated holes around the card connect signal lines and power supplies to the card. Eardley and Berggren<sup>4</sup> have described the characteristics of the card.

#### System design

# • Arithmetic and Logic Unit

The machine organization framework chosen for the model effort was contributed by J. Cocke, of IBM Research, and G. Amdahl and E. M. Boehm of the IBM Data Systems Division.

The Arithmetic and Logic Unit (ALU) will execute operations upon command from a Program Control Unit (PCU), which was not implemented. The operation is specified by a 14-bit instruction word. Each operation requires two 64-bit operands: the *A* operand and the *B* operand.

During the first cycle of execution of any operation (called an *answer* cycle), ALU will emit an answer signal to the PCU. From the rise of this answer signal, the PCU has one ALU cycle to drop its request line if it wants no

Figure 2 A tunnel diode pair subassembly.



Figure 3 Twelve tab-mounted transistors welded to a wafer.





Figure 4 A card with modules.

further service. If it desires further service, its request line may remain up, but from the rise of the answer signal, it has a maximum of one ALU cycle to specify the new operation and two ALU cycles to specify operands.

At the end of execution, the result is emitted from ALU in synchronism with ALU's clock and is available on the Out Bus for one ALU cycle.

## • Functional units

ALU executes operations by means of a set of functional units interconnected as shown in Fig. 5. There are four control rings: A, M, D, and S. The A ring controls execution of an ADD or SUBTRACT operation. The M ring serves to execute a MULTIPLY operation. The D ring serves to execute DIVIDE. The S ring serves to execute SHIFT-CONNECT.

An interference is defined as any combination of ring states such that more than one ring trigger excites the same functional unit of ALU. The excitation of the main ALU controls is constrained only by the following rules: (1) no more than one ring may be started during any given cycle, and (2) no interferences are allowed.

To prevent interferences, a feedback signal derived for each ring input, blocks a start excitation that would lead to an interference. These blocking signals depend upon the total control state. The functions can be derived in a straightforward manner.

The organization of each functional unit can be briefly described as follows:

In Bus: A simple gate structure to select one of the four PCU's and allow both the A and B operands to be transmitted to the next three functional units.

Connector: This unit performs one of 16 possible logic operations to the 64-bit A and B operands on a bit basis (i.e.,  $A \cdot B$ ,  $\overline{A} \cdot B$ ,  $A + \overline{B}$ , A + 0, et cetera).

Shifter: A 64-bit shifter consisting of three banks of 4-way distribution (cascode) circuits, with options of end-around and left-for-right reversal of bits.

Multiply-Divide Unit: Grossly, the structure of this unit is a double bank of Carry-Save adders. Combining MULTIPLY and DIVIDE as one unit permits the efficient use of common equipment. One DIVIDE iteration requires two ALU cycles while two MULTIPLY iterations are performed in one ALU cycle. In both MULTIPLY and DIVIDE, four bits (of multiplier or quotient) are processed per iteration.

Adder: A Carry Select<sup>5</sup> structure is used. With eight transistors for a binary full adder the Carry Select structure is less expensive than Carry Lookahead.

Out Bus: This unit selects the answer and gates it to the PCU.

#### Timing

Basically the ALU is a one-clock system. All functional units have registers on outputs. Each register has one control port. With its control negative, data is latched in the register. With its control positive, data at the input is transmitted to the output with one circuit level delay. The control is a 5-nsec pulse. Since this excites all registers simultaneously, the circuit and wiring delays must serve as dynamic storage elements. The minimum length path between registers is 3 circuit levels and the maximum is 17. These paths set the clock pulse width (5 nsec < 3 circuit levels of delay) and cycle time (62.5 nsec > 17 circuit levels of delay).

# • ALU Performance

The system operating speeds, based on the 62.5 nsec cycle obtained in the model, are shown in Table 1.

Table 1 System speeds.

| Operation         | No. of<br>Cycles | Execution<br>Time |
|-------------------|------------------|-------------------|
| SHIFT-CONNECT (S) | 2                | 0.12 μsec         |
| ADD/SUBTRACT(A)   | 4                | 0.25              |
| MULTIPLY (M)      | 9                | 0.56              |
| DIVIDE (D)        | 29               | 1.81              |
|                   |                  |                   |



Figure 5 Arithmetic and Logic Unit. The number of transistors is indicated in each box.

# **Layout of ALU**

After the logic organization of the complete ALU was determined, several packaging proposals were evaluated. The first was a layout of data paths on cards with 120 pins and 200 circuit positions. Since these layouts were severely pin-limited most of the required 72 cards were only 1/3 populated. An even more important factor was the additional circuits required for driving long lines off the cards. These circuits increased the worst-case paths by 10 to 20 percent.

A second packaging approach attempted to overcome these deficiencies. Cards were laid out in a checkerboard or parquet pattern. Each card could butt against four others and could have up to 200 pins on each edge. Layout was by function; that is, each card usually contained a specific function such as In Bus A, In Bus B, Exponent Arithmetic, et cetera. This layout required only 24 cards but suffered from a serious line congestion problem on many cards. The In Bus, for example, has four transmission lines entering and three leaving for each of 64 bit positions. Functional packaging in this case requires a card with 144 circuit positions and 455 input/output lines.

In an effort to reduce line congestion, the third and final approach was tried. The parquet pattern was retained but logic functions were spread across cards almost irrespective of card boundaries. This is not unreasonable because circuits on adjacent cards are separated by only 2 to 3 inches of printed wiring of controlled characteristic

impedance. These lengths are short enough to qualify as point-to-point wiring not requiring line drivers and line terminations.

The card for this final layout has 240 circuit positions and approximately 200 possible I/O (input-output) connections per edge. The card size is  $9'' \times 10''$ . Edges of cards overlap by 1'' so that the through-plated holes which serve as I/O points line up on adjacent cards. For the model, pins were soldered into these through-plated holes to complete the I/O connection.

Twenty cards forming a sheet  $37'' \times 42''$  were required for the complete layout. This is shown in Fig. 6. Sections associated with specific functions are marked. Unmarked sections are utilized for general control logic. Ninety percent of available circuit positions are utilized.

### • Pin counts

The over-all assignment of logic to cards was shown in Fig. 6. Note that some card boundaries fall between major logic functions while others are contained within the boundaries of major logic functions. Heaviest wiring density does not seem to correlate obviously with logic boundaries. The wiring density over the 20-card array is relatively uniform. Typically each 10-pin circuit module has 6 lines crossing in both the North-South and East-West directions (i.e., 72 lines on the short edge with 12 circuits, and 126 lines on the long edge with 20 circuits). In the areas of high wiring density there are approximately

10 lines per circuit in either direction.

The ALU layout indicates that an average of 334 pins per card are required for an average density of 344 basic circuits per card. This in no way conflicts with the often quoted 1:1 pin-to-circuit ratio requirement. However, the packaging technique specified packs 1.7 basic circuits in a 10-pin module. Furthermore, even if 1.7 pins per module were available, remember that this is an average requirement. It would result in 11 out of 20 cards being pin limited. If we go to maximum requirements rather than averages, we obtain the figure 491 pins/240 modules, or 2.0 pins/module. Even this, however, is an average of some sort since the four edges have been added together. Examining the edge requirements results in 200 pins per long edge and 120 per short edge which gives the figure 640 pins/card and 240 modules/card, or 2.7 pins/module.

Changing the basic card size will change this ratio. We observe that, per module, we find a maximum requirement of approximately 10 lines running in the direction of the flow of data and slightly more than 10 lines running cross field. Smaller cards will therefore have a higher pin/module requirement. For example, a card half the size of the card used would require 440 pins/120 modules, or 3.7 pins/module. In the limiting case of one large card, we have 840 I-O pins/4281 modules, or 0.2 pins/module.

### **Statistics**

The basic circuit types are shown below. A basic circuit is defined as having one and only one current source.

Variations of these basic circuits are formed by paralleling transistors. Of the various possibilities, seven were used, shown in the illustration below.

The O1 configuration is used as (1) a buffer into wired ANDS, (2) a means of extending fan-out, (3) with tunnel diodes as a line driver, (4) at a higher current level and with tunnel diodes as a power driver.

The functional combinations of basic circuits are indicated in Table 2.

Many other combinations of basic circuits and functional circuits are possible. The ALU layout has been restricted to using the seven basic circuits and six functional circuits mentioned. When packaging these 13 circuits we have the problem that sometimes resistors are required on a signal pin and sometimes not. This variation requires four additional assemblies differing from previous ones only in the absence of a resistor. These four variations could be eliminated with a 12-pin module. For drivers we add tunnel diodes to O1's and change the emitter resistor for the power driver. These two additions bring the total number of circuit assemblies to 19.

Sixteen percent of the functional circuits in the ALU are drivers. Now (1) if all circuits could drive two or three transmission lines directly this would drop to 9%, (2) if fan-out from each logic circuit were 6 to 8 instead of 3 to 4 the driver count drops only another 1.5%, (3) if the power driver (PD) was designed to drive 48 or 64 bases instead of 16, and if loading was chosen to make better use of the two available outputs, then the 316 PD's would be re-



Table 2 Other circuit combinations used

| Functional<br>Circuits                                       | Code | Composition                                       | Module<br>Size | Number<br>of Signal<br>Pins |
|--------------------------------------------------------------|------|---------------------------------------------------|----------------|-----------------------------|
| Binary Full<br>Adder                                         | BFA1 | Three feed-<br>back current<br>switches           | 1 cell         | 3                           |
| Exclusive Or                                                 | xo   | Two I2's                                          | 1 cell         | 3                           |
| Binary Full<br>Adder                                         | BFA2 | Two XO's & two O2's                               | 2 cells        | 5                           |
| Gate (effectively two two-input ANDS feeding a two-input OR) | G    | One I2 and<br>One O2                              | 1 cell         | 5                           |
| Register                                                     | Т    | One I2 and<br>One O2 plus<br>two tunnel<br>diodes | 1 cell         | 3                           |
| Shift Cell<br>(can shift 4<br>bits 4 ways)                   | SC   | Four S4's (four-way spread gates)                 | 3 cells        | 18                          |

placed by an estimated 60 PD's. The combined effects of (1), (2), (3) would be to replace 665 drivers with approximately 80 PD's. This would save at least 1000 transistors.

Only 5% (viz., 191) of the circuits shown are Exclusive Or (XO's). However, another 284 are used in Binary Full Adders (BFA). Also parity checking is not included in this ALU, but is used in operational systems. Therefore, XO's would form a fairly large and (with six transistors per XO) an expensive portion of an operational system.

Gates, Binary Full Adders, and Registers account for 58% of the transistor usage. Saving one or two transistors in any of these circuits would make a large reduction in total transistor count.

#### The model

A full layout of the ALU was completed. This layout included positioning of all modules such that the resulting wiring to interconnect the modules seemed feasible. Consideration was given to shortening the wire length of



Figure 6 A paper mock-up of ALU.

worst-case paths and minimizing the total number of lines crossing card boundaries. The final layout covered 20 cards with a population density of 89.3% and with a maximum of 170 lines crossing a card edge. The total ALU layout included 4,281 modules and 20,756 transistors.

From this ALU a nucleus was abstracted to serve as a model. The following characteristics were included:

- 1) Controls were provided to give priority, blocking, and start-up in a manner similar to the full ALU.
- 2) Data paths for the MULTIPLY/DIVIDE unit were included. These are the worst-case paths in the ALU. Wire lengths, spacing, circuit placement, et cetera, mimic the full system.
- 3) Two PCU sources were provided, each supplying a 2-bit exponent and 3-bit fraction for both A and B operands, plus an 8-bit instruction.
- 4) A PCU receiving register was provided which could in turn be used in a feedback mode to supply new operands or instructions.

It is felt that this model provided a good vehicle for study of the behavior which might be expected if the entire ALU were constructed. Sufficient worst-case paths and control paths were provided for a realistic test of the timing relationships.

## Model statistics

The model consists of a 10-card nucleus of 1,838 transistors and 424 circuits as tabulated in Table 3.

All lines passing in and out of the above cross-section of

Table 3 Circuit and transistor data

|                     | Circuits | Transistors |
|---------------------|----------|-------------|
| MPY/DIV Unit        | 245      | 1225        |
| In Bus (A, B)       | 14       | 42          |
| Instruction Bus     | 6        | 24          |
| Front End Controls  | 56       | 187         |
| Rings & MD Controls | 83       | 270         |
| PCU's               | 20       | 92          |
| Total               | 424      | 1838        |

the model are biased or terminated, respectively, in order to facilitate proper circuit operation. Figure 7 shows the completed model.

# **Experimental results**

#### • Assembly experience

The technique of resistance-welding one-mil wires proved very satisfactory. Transistor leads were cut to 50-mil lengths, which provided for a second weld if the first was unsuccessful. As a result, only 6 of 2,300 transistors were lost due to welding. No transistors were known to be damaged by overheating or high voltage at any phase of the testing. The only damage to transistors occurred in handling prior to assembly in a module.

Tunnel diodes were frequently lost when the +1 or -3 volt supply was accidentally shorted to a signal terminal. Only one diode pair at a time can be lost this way since the excess voltage will not propagate down a chain of circuits. A voltage interlock was also used and the distribution system impedances were controlled so that any supply could be shorted to any other supply without damaging any circuit component.

## • Typical transients

Figure 8 shows the waveform on the output of a line driver when driving (a) too low an impedance and (b) the proper impedance. When properly loaded, the line driver has rise times of 1.3 nsec. This rise time deteriorates to approximately 2 nsec after passing through three cards, as shown in (c).

Figure 9 shows (a) the input to an 8-inch, 26-ohm line which terminates in three power drivers and 22 ohms; and (b) the output of one of these power drivers. Originally, unterminated 100-ohm lines were used for "short" lines. These proved to be completely inadequate for the transmission of 5-nsec pulses. They were converted to lower impedance lines by paralleling 70-ohm coaxial cable with the existing 100-ohm printed wiring (70 ohms in parallel with 70 ohms in parallel with 100 ohms = 26 ohms). Power



Figure 7 The model. 20,000 transistors would occupy 20 laminated cards in a 4×5 array.

This model used circuits and power on only 10 cards.

drivers had been designed to drive 22 ohms and this was then used as a line termination.

## • Performance of the Divide worst-case path

This worst-case path consists of 33 circuits, 29 for logic and 4 for power drivers. The number of logic functions (i.e., AND-OR's) in this chain is 56. The total line length was 84", or 15.5 nsec. The path delay breakdown is as follows:

| Logic circuits Power drivers | 90 nsec<br>20 |  |
|------------------------------|---------------|--|
| Line delay                   | 15            |  |
| Total delay                  | 125 nsec      |  |

Two figures for worst-case average delay can be used:

Delay per level of raw logic: 90/56 = 1.6 nsec Delay per level of logic (in a system): 125/56 = 2.2 nsec

Note that line delay adds 15/33 = 0.45 nsec per circuit.

#### **Conclusions**

The primary purpose of the model program was to evaluate the existing circuit technology and provide direction



Figure 8 Waveforms at a line driver driving (a) three 70-ohm lines and a 7-inch, 100-ohm unterminated stub, and (b) three terminated 70-ohm lines. Waveform (c) is at a point 23" away from a line driver after passing through three card-to-card connections.



Figure 9 A pulse (a) into a 26-ohm line which terminates in three power drivers and 22 ohms; and (b) the output of one of the power drivers loaded with 16 bases.

to circuit and component design efforts. Obviously, valid conclusions can be drawn only if the model under consideration includes problems that are typical of high-speed parallel computers. Although there are many factors of machine organization which were not included in the model, such as serial shifters, multiphase clocking schemes, synchronization with either local or bulk memory, et

cetera, the bulk of circuit problems which can arise from a highly parallel structure have been tested. Experience with the model leads us to the following conclusions:

- 1) All lines must be considered transmission lines. In the model layouts, lines shorter than 10 inches were considered short enough to disregard transmission line effects. In practice, lines as short as 2 inches showed distinct reflections. This was due to rise times that were faster than expected, combined with a line propagation velocity of 5.4 inches/nsec. The most severe reflection problem occurred with power drivers driving many branched nets.
- 2) The general technique for distributing low voltages is adequate. Minor changes in the present model would limit the distribution dc drop to 2 mv and ac noise to 4 mv.
- 3) The parquet panel approach provides adequate communication channels, and accessibility of all signal points is a major convenience. Line lengths proved to be longer than originally anticipated but were significantly shorter than those in the other layout schemes considered.
- 4) The IMPLICATION function obtained with the cascode circuit is very useful. The Register, Exclusive Or, and Spread Gate are important examples. More experience will probably result in increased use of this function.
- 5) The delay figure applied to a circuit family is usually the delay of a logic circuit divided by the number of logic functions performed by this circuit. This definition yields a figure of 1.6 nsec per level of logic for the circuits used in this model. Experience with the model leads to a more useful definition of delay, i.e., the delay per level of logic including power driver and line delays in a true system environment. With this definition, 2.2 nsec delay per level of logic was obtained in worst case paths.

## References

- Digest of International Solid State Circuits Conference, 1960 through 1963.
- D. H. Chung and J. A. Palmieri, "Design of ACP Resistor-Coupled Circuits," IBM Journal, this issue, p. 190.
- D. W. Murphy and J. R. Turnbull, "Design of ACP Tunnel-Diode-Coupled Circuits," IBM Journal, to be published.
- 4. D. B. Eardley and J. L. Berggren, "A Multilayer Printed Circuit Card for Nanosecond Circuits," *IBM Journal*, to be published.
- O. J. Bedrij, "Carry-Select Adder," IRE Trans. on Electronic Computers, EC-11, 340 (1962).

Received March 5, 1963