NC State University

ECE 741: Sequential Machines

Homework #7 - Spring 2003Assignment

You have been working on a design that has grown very large, and the tool execution times have become excessively long. You would like to split the design into two parts and run the tool-flow on each part indpendently. In this assignment, you will create a hierarchical floorplan for a design containing a microprocessor (with memories) and a UART. Then you will place, synthesize the clock-tree, route, extract, and verify the timing of the design.

Tutorial

  1. Creating the floorplan
    For this assignment, our floorplan will have the following constraints:
    • The top-level aspect ratio should be 1:1 (square) to simplify the packaging and manufacturing.
    • All blocks should be rectangular to simplify the floorplanning process.
    • The off-chip pins should be distributed as evenly along the outer perimeter as possible, again, to simplify the packaging.
    • In order to simplify our hierarchical timing verification flow later on, we will assume that a buffer cell is placed in-between the or1200 and uart16550 designs as shown below. This will waste some area, but the wire-lengths will be so short that the buffer load and drive capability can be accurately and easily modeled in PrimeTime. Therefore, we must constrain our floorplan to make sure that on-chip pins are as close together as possible.
      Buffer Design
    • Lastly, since there are four clocks in this design (three for the or1200 and one for the uart16550 design), all of which come in through boundary pins, all pins must be as close together as possible to minimize skew.
    Our floorplan will consist of the dimensions for each block and the boundary-pin assignments. FE will place the blocks for us, and we will ignore partitions and power-routes for this assignment. We will assume that the UART is connected to the OR1200 processor as a slave on the data-wishbone bus and that the pins will be arranged as shown below. Since we don't care too much about the off-chip pins in this assignment, we'll allow them to be randomly distributed around the edges. Since there are only 9 pins left on the UART, however, the right side of the chip will be rather sparse. In practice, you would probably not want to take up an entire side of a chip with a block as simple as this UART, but we'll ignore this for now.
    Chip Floorplan
  2. Floorplanning the or1200 design with memories
    • Create a run directory for this assignment. Since we're working with the or1200 design again, so you may leave it in the same place or copy it locally if you like.
    • Assuming that you left the ARTISAN strings defined in the or1200_defines.v file, your final netlist from Homework #6 should be a good starting point for this homework. To make sure, look in your verilog netlist for cell-names that start with "art_" (these are the memories).
    • Create an FE run directory with dirSetup.py run/fe. Change to this directory, type add cadence and the encounter.
    • Choose Design -> Import. We will import the design as before, except that we need to add an entry to the Block Cell Libraries and LEF Files sections for the memories. Typing these long paths into this window can be very time consuming, so I find it to be much easier to modify the .conf file in a text editor. For your convenience, you may use the or1200_top.conf file provided here. Copy this file into your directory and examine it. Note that the ui_blklib line refers to a directory that contains .cdump files for all of the memories. You shouldn't need to modify this line for your own project. Next, note that the ui_leffile line refers to the ncsulib25.lef file, as before, but that it also refers to a number of other LEF files. You will need to add similar entries to this line for your project if it contains memories. Click the Load... button to load the design.
    • Once the design has been loaded, you should see the memories appear to the right of the floorplan area. Now that FE has estimated the total area for the design with the memories, we can use this estimate to make our floorplan. Zoom in on the upper right-hand corner and write down the coordinates. On my design, the coordinates are approximately (2942 um, 2942 um).
    • Next, we need to make a pin-constraint file. It's difficult to make this from scratch, so we'll let FE create an initial pin-constraint file. In order to do that, we have to place the design first. Place the design with Place -> Place.... The memories should be placed along with all of the standard cells.
    • Now that the design has been placed, we can save the pin constraint file with Floorplan -> Save Floorplan -> IO File.... Make sure the sequence box is checked, and save this file as or1200_top.io.
    • Examine the or1200_top.io file. This file contains entries of the form "Pin: port_name side metal_layer" where the side can be N, S, E or W and metal_layer can be 1-5. The order of pins on a side is governed by the order in which they appear in this file. As shown in the figure below, the first pins to appear in the file for a given side will have the lowest X or Y coordinate, while the last will have the largest.
      Pin Order
    • Next we need to modify this file to match the floorplan listed above. We won't bother specifying the metal-layer, so we'll strip that information out and let FE choose for us. For your convenience, you may use the or1200_top.io file provided here.
    • Exit FE. Before we can finish floorplanning this block, we need to know its exact dimensions. We won't know that until we know how large the UART is, so let's move on to that block and come back to this one later.
  3. Synthesizing and Floorplanning the UART
    • Copy the file uart16550.tar.gz locally. Unzip and untar it. Examine the file uart16550/syn/bin/top.scr. Notice that the clock uncertainty has been set to 500 ps, and that the design should attempt to fix hold-time violations, just as with the or1200 design.
    • Change to the uart16550/syn/run directory. Type add synopsys and synthesize the design with the command ../bin/run_syn. The flat netlist should appear as ../out/final_uart_top.v in about 5 minutes.
    • Return to the run/fe directory and start encounter. Repeat the steps that you took for the or1200 design. When I do this, I get the upper right-hand coordinates (713 um, 713 um) and the modified uart_top.io file given here.
    • Now that we know the areas of the blocks, we can choose the dimensions. The total area is approximately the sum of the UART and OR1200 areas:
      A = 29422 + 7132
      The chip side-length and height of the UART is the square-root of the total area, and the width of the UART can be determined by the ratio of its area to the total area:
      Huart = sqrt(A) = 3027 um
      Wuart = sqrt(A) * (7132/A) = 168 um
      To be precise, we need to restrict these numbers to the placement grid, using our site from our technology file, which is 1.08 um wide by 12.96 um high. Rounding the previous numbers to this grid, we get the following:
      Huart = 3019.68 um
      Wuart = 167.4 um
      Now we need to finish the floorplan for the UART.
    • Choose Floorplan -> Clear Floorplan. Select All Floorplan Objects and click OK. This will reset the floorplan.
    • Choose Floorplan -> Load Floorplan -> IO File... to load the uart_top.io file.
    • Choose Floorplan -> Specify Floorplan. Select the Size by, Core Size by, and Width and Height buttons and set these values to the ones determined above. Click OK. The floorplan should be updated, and you should notice the pins are now in the positions you specified.
    • Place the design with Place -> Place.... Save the design as uart_top_placed.
  4. Synthesizing the Clock Treees
    Next, we want to insert clock-trees that are as closely matched as possible in terms of insertion delay so that the skew between the blocks is minimized. To do that, we will first synthesize the clock trees for all designs with unconstrained insertion delay and then find out which one has the largest insertion delay. We will then try to match all other blocks to that block's insertion delay.
    • Use the following uart_top.ctstch file to synthesize the clock-tree for the UART. Note that this file is very similar to the one you ended up with in Homework #6, except that the root pin has been changed.
    • Exit and restart FE. Import the or1200 design again. Use Floorplan -> Load Floorplan -> IO File... to load the or1200_top.io file and Floorplan -> Specify Floorplan to set the width and height. An analysis similar to the one above will give us the following dimensions for this design:
      Hor1200 = 3019.68 um
      Wor1200 = 2859.84 um
      Enter these dimensions for the block and click OK to complete the floorplan.
    • Place the design with Place -> Place.... Save the design as or1200_top_placed.
    • Use the following or1200_top.ctstch file to synthesize the clock-tree for the or1200. Again, note that this file is very similar to the one you ended up with in Homework #6, except that all three clocks are now synthesized.
    • Next, examine the clock-tree synthesis reports (the .ctsrpt files) for both blocks. You should find information similar to the following:

      Block Clock Sinks Levels Ins. Delay (ps) Skew (ps)
      uart16550 wb_clk_i 606 7 766 48
      or1200 dwb_clk_i 73 7 714 45
      or1200 iwb_clk_i 73 5 520 27
      or1200 clk_i 1151 9 1241 125

      Note that the clk_i tree has the most insertion delay and the most levels. Note also that the "Nr. of Level" line of this file sometimes gives an incorrect value. I scroll down in the file and count the number of levels to get the actual value.
    • Modify the the or1200_top.ctstch and uart_top.ctstch files so that MaxDelay is 1300ps and MinDelay is 1200ps for all 4 clocks.
    • Exit FE and delete the or1200_top_cts and uart_top_cts directories. Restart FE and re-synthesize the clock-trees for both designs. When you are done, examine the log-files and you should see new results close to the following:

      Block Clock Sinks Levels Ins. Delay (ps) Skew (ps)
      uart16550 wb_clk_i 605 5 1205 48
      or1200 dwb_clk_i 73 5 1211 50
      or1200 iwb_clk_i 73 5 1239 15
      or1200 clk_i 1151 7 1286 112

      The longest insertion delay is 1286+(112/2) = 1342 ps and the shortest is 1205-(48/2) = 1181 ps, which gives a skew of 161 ps. Not too bad! Now route both designs, extract the parasitics to SPEF files, and write the routes out as DEF files. Exit FE.
  5. Verifying the Timing
    We will now verify the timing hierarchically with three steps:
    1. A preliminary timing check to determine the maximum transition time and insertion delay of the clock.
    2. An interior timing check to verify that the timing inside the block is correct. This step will use the set_propagated_clock command so that PrimeTime will include the insertion-delay of the clock-tree in its setup- and hold-time checks. This step will use the maximum transition time from the last step as clock-uncertainty (to verify that our safety margin is met). In addition, the insertion-delay from the last step will be used as the input-delay for off-chip inputs (otherwise, we would have hold-time problems). This check will also record the delays and transition-times of on-chip outputs for the last step.
    3. An interblock timing check to verify that the timing between blocks is correct. This step will check only the on-chip inputs, using the the on-chip output timing from the last step as the input-delay.
    Do this with the following steps:
    • From your root directory, type the command dirSetup.py run/oa run/pt to create the necessary directories. Go to the run/oa directory and import the two DEF files with the following commands:
      def2oa -def ../fe/or1200_top.def -lib mylib -cell or1200_top -view autoLayout -tech TSMC025_deep
      def2oa -def ../fe/uart_top.def -lib mylib -cell uart_top -view autoLayout
      The -tech option is not needed once the library has been created by the first command. You will get a number of errors, because the memories are not defined in the OpenAccess database, but you may ignore these errors.
    • Copy the file gensinklist.py into this directory and run it. You will notice that it is slightly different from the one used in Homework #6. First of all, it strips the leading back-slash characters. Secondly, it creates a separate .tcl file for each clock (4 total) using a new procedure called writeSinkFile.
    • Change to the run/pt directory. Copy the file timing_prelim.tcl locally. Examine the file. It contains a procedure called clockskew that behaves similarly to the one in Homework #6, but it no longer creates a file with all insertion delays. It loads each Verilog and SPEF file and then the lists of sink- and driver-nodes that you just created. The results are written to a file called prelim.rpt.
    • Run this step with the command pt_shell -f timing_prelim.tcl. You will get a number of errors, because no timing information is defined for the memories, but you may ignore these. Examine the prelim.rpt file and record the largest values for insertion-delay and transition-time. When I run this step, I get a max insertion-delay of 2654 ps and transition-time of 552 ps. For your reference, here is my prelim.rpt file.
    • Copy the the file stripwarnings.py locally. This is basically the same file you used in Homework #6, only this time, our scripts will execute it automatically.
    • Copy the the file io.tcl locally and examine it. This script defines the lists of on-chip inputs and outputs for each block in four lists: or1200_onchip_i, or1200_onchip_o, uart_onchip_i, and uart_onchip_o. This file will be used by other scripts to set-up the timing checks.
    • Copy the the file timing_interior.tcl locally. Edit the file. This script contains the following steps:
      • The first part of the script is a procedure called onChipOutputTiming that takes a list of ports and writes a TCL file to define a list of timing values, including the max and min rise-delay, fall-delay and the rise and fall transition times for every port.
      • Next, the io.tcl script is loaded and the CLK_SAFETY_MARGIN and OFFCHIP_INPUT_DELAY variables are defined.
      • Next, the Verilog and SPEF files for the or1200 design are loaded and the clocks and clock uncertainty are defined. The set_propagated_clock tells PrimeTime to include the insertion-delay of the clock in the setup- and hold- time checks.
      • Next, the off-chip inputs are defined as all inputs minus the on-chip inputs, clocks, and reset lines. The input-delay for these inputs is set as the insertion-delay of the clock, and the driving cell is set to be a buffer of drive-strength 2 (which models a pad driver) All other inputs are set as false paths, since we will worry about them later.
      • Next, the max and min paths are reported to a file called temp.rpt. Then the stripwarnings.py script gets rid of the extra warnings and writes the result to or1200_timing.rpt.
      • Next, the onChipOutputTiming procedure is called using the list of on-chip outputs defined in the io.tcl file. This will create a file called or1200_onchip_otime.tcl, which defines a list called or1200_onchip_otime for use in the final step.
      • After that, the same steps are performed on the UART.
      Set the CLK_SAFETY_MARGIN to be the max transition-time (552ps in my case) and the OFFCHIP_INPUT_DELAY to be the max insertion-delay from prelim.rpt plus 200ps (since we assumed an input delay of 200ps during synthesis, making it 2854ps in my case). Then run the script with the command pt_shell -f timing_interior.tcl.
    • Examine the or1200_timing.rpt file. When I run this script the longest timing path shows the slack is violated by 14.8 ns, meaning that my minimum cycle-time would be 54.8 ns, instead of 40ns. The hold-time check shows that the slack has been violated by 98 ps. We'll have to either re-synthesize the or1200 design with more clock-uncertainty or re-synthesize the clock-tree with a smaller transition-time to fix this hold violation. For your reference, here is my or1200_timing.rpt file.
    • Examine the uart_timing.rpt file. When I run this script the setup-time check is met with positive slack, but the hold-time check shows that the slack has been violated by 26 ps. Again, we'll have to either re-synthesize the or1200 design with more clock-uncertainty or re-synthesize the clock-tree with a smaller transition-time to fix this hold violation. For your reference, here is my uart_timing.rpt file.
    • Copy the the file timing_interblock.tcl locally. Edit the file. This script contains the following steps:
      • The first part of the script is a procedure called setInputTiming that searches for an output port name in the list of output timing created by the last script. Then it sets the delay and driving characteristics of the corresponding input port based on what it finds.
      • Next, the same io.tcl file is loaded, and the CLK_SAFETY_MARGIN is set, just as in the previous script.
      • Next, the Verilog and SPEF files for the or1200 deisgn are read, and the clocks are created, just as in the previous script.
      • Next, the uart_onchip_otime.tcl script is loaded to define the uart_onchip_otime list of timing values. The setInputTiming procedure is called on every or1200 input, giving the corresponding UART output. Thus, we magically implement a hierarchical timing check.
      • Next the max and min paths for these inputs are checked, warnings are stripped, and the result is written to the file or1200_interblock.rpt
      • After that, the same steps are performed on the UART.
      Set the CLK_SAFETY_MARGIN to be the max transition-time (552ps in my case) as you did with the last script. Then run the script with the command pt_shell -f timing_interblock.tcl.
    • Examine the or1200_interblock.rpt file. When I run this script, the setup- and hold-time checks show no violated timing paths! Nothing to fix. For your reference, here is my or1200_interblock.rpt file. Note that there are no reports for the iwb_clk_i clock-group because there are no paths from the on-chip inputs to flip-flops on that clock.
    • Examine the uart_interblock.rpt file. When I run this script the setup-time check is met with positive slack, but the hold-time check shows that the slack has been violated by 4 ps. For your reference, here is my uart_interblock.rpt file. Normally, we would have to resynthesize to eliminate the hold-violation, but there is something that we've overlooked up to now: the delay of the buffer between the blocks. This buffer will add some delay, which will add to our minimum cycle-time and help us with inter-block hold-time violations. PrimeTime includes the "load-dependent" portion of the delay for this buffer, because we used the set_driving_cell command. However, there is still a "load-independent" portion that has been ignored. This delay is dependent on the rise-time at the input. The table below summarises this delay for various transition times.

      Tansition Time Min Delay Max Delay
      0 - 250 ps 73 ps 111 ps
      250 - 500 ps 106 ps 140 ps
      500 - 750 ps 117 ps 173 ps
      750 - 1000 ps 121 ps 207 ps

      We can examine my or1200_onchip_otime.tcl file and see that all the transition times lie between 0ps and 250ps, which means that there must be a minimum buffer delay of 73ps. Therefore, the slack has not actually been violated. (Here's my uart_onchip_otime.tcl file, in case you want to look at that, too).
    • We can summarize the timing of this system with the following two tables. First, a setup-time table shows the target clock-period and the slacks reported in the or1200_timing.rpt, uart_timing.rpt, or1200_interblock.rpt, and uart_interblock.rpt files. To find the minimum cycle-time for each block, we subtract the minimum of the interior and inter-block slack from the target clock-period. You will need to create a similar table for your project. You do not need to fix the negative slack, because the chip will still work. It's easy to tell from this table that the overall cycle-time will be 54.8 ns.

      Block Clock Period Interior Slack Inter-block Slack Min. Cycle Time
      uart16550 wb_clk_i 40 ns 32.0 ns 37.4 ns 8 ns
      or1200 dwb_clk_i 40 ns 7.4 ns 28.2 ns 32.6 ns
      or1200 iwb_clk_i 40 ns 3.8 ns n/a 36.2 ns
      or1200 clk_i 40 ns -14.8 ns 15.4 ns 54.8 ns

      Likewise, we can create a table to summarize the hold time slack from all four report files as shown below. Examination of the two _onchip_otime.tcl files and the buffer delay-table above will tell us what the minimum buffer delay is. You will need to create a similar table for your project. We can easily tell from this table that there are no inter-block hold time violations that we need to fix. However, the interior slack violations do need to be fixed in order to guarantee operability with a sufficient margin of safety.

      Block Clock Interior Slack Inter-block Slack Min. Buffer Delay
      uart16550 wb_clk_i -26 ps -4 ps 73 ps
      or1200 dwb_clk_i 77 ps 706 ps 73 ps
      or1200 iwb_clk_i 73 ps n/a n/a
      or1200 clk_i -98 ps 1592 ps 73 ps
  6. Complete the Assignment by fixing the hold-time violations.

Submission

You should turn in a .tar.gz archive containing the following files:
  1. A hw7.html file that contains setup- and hold-time summary tables for your final design, similar to the ones above. There must be no hold-time violations.
  2. Your final prelim.rpt file.
  3. Your final timing_interior.tcl file. The CLK_SAFETY_MARGIN must match the value in your prelim.rpt file. Likewise, the OFFCHIP_INPUT_DELAY must be 200 ps greater than the value in your prelim.rpt file.
  4. Your final or1200_timing.rpt and uart_timing.rpt files. The slacks in these files must match the slacks in your summary tables.
  5. Your final or1200_interblock.rpt and uart_interblock.rpt files. The slacks in these files must also match the slacks in your summary tables.
  6. Your final or1200_onchip_otime.tcl and uart_onchip_otime.tcl files. The transition times in this file must reflect the minimum buffer delays in your summary table.
ECE Department | College of Engineering | NC State University | Contact Us | © 2007 WolfTech Web Team