Experiments with FPGA Logic Synthesis - Project ElasticC

Back to Index

Table of Contents


Introduction

An interest in FPGA synthesis began with a university group project entitled "Realtime Image Processing using an FPGA". For this we were supposed to use a commercial HLS tool, Catapult C. However as soon as I started using it felt it was clunky and sub-optimal for simple streaming single-cycle applications like our project. As a result I decided I would have a go at building a simple tool to synthesise a C-like language to RTL; automatically inserting pipeline registers as required to pass timing for a given operating frequency (the idea of building this was not taken particularly seriously by other members of the group at first). In its simplest form it would be designed for applications where all loops can be unrolled at compile time - this was not actually a major limitation for the purposes of our project. This tool sped up the project itself significantly as it allowed ideas to be prototyped quickly and easily.

After developing an initial prototype for the single-cycle case, and continuing to add features to it, I also continued to experiment further. This included functionality to convert designs where loops couldn't be unrolled to state machines. I then experimented with synthesising simple designs to the level of device primitives - LUTs, single flip-flops, and fast carry logic/multipliers if supported by the target device.

The new, neater codebase, which while is not yet as capable as what is described here, is being actively developed and is published as open source (MIT License) on GitHub.


A C-like Language for basic HLS

Language Overview

The initial language was designed to closely follow C syntax with some cues also taken from C++; but modified and extended to better support use cases we were interested in. Unlike other more advanced HLS tools; I decided to prioritise ease of use and tidiness over strict compatibility with exisitng C code.

A typical block - which synthesises to a VHDL entity - is this example of converting a pixel stream to greyscale then performing horizontal and vertical Sobel edge detection on it. This would be converted to a pipelined VHDL entity which processes one new pixel each clock cycle.

block convolute_greyscale(clock<rising,56e6>, enable, unsigned<6>[3] input) => (register unsigned<6>[3] output) {
  int i = 0, j = 0, k = 0;

  stream2d<unsigned<6>,5,5,1024> greyscale;
  //Convert pixel to greyscale, and push it into the stream
  greyscale << ((input[0] + input[1] + input[2]) / 3);

  //5x5 sobel matrices
  signed<5>[5][5] m1 = {{1, 2, 0, -2, -1},
                        {4, 8, 0, -8, -4},
                        {6, 12, 0, -12, -6},
                        {4, 8, 0, -8, -4},
                        {1, 2, 0, -2, -1}};

  signed<5>[5][5] m2 = {{-1, -4, -6, -4, -1},
                        {-2, -8, -12, -8, -2},
                        {0, 0, 0, 0, 0},
                        {2, 8, 12, 8, 2},
                        {1, 4, 6, 4, 1}};
  //convolute with horizontal and vertical sobel filters
  signed<16> sum1 = 0, sum2 = 0;
  for(j = 0; j < 5; j++) {
    for(k = 0; k < 5; k++) {
      sum1 += greyscale[j][k] * m1[j][k];
      sum2 += greyscale[j][k] * m2[j][k];
    }
  }
  //take the absolute value of both sums
  if(sum1 < 0) {
    sum1 = 0-sum1;
  }
  if(sum2 < 0) {
    sum2 = 0-sum2;
  }
  //add the absolute values of the two convolution results,
  //and normalise by dividing by 32
  unsigned<9> sum = (sum1 + sum2) >> 5;
  unsigned<6> result;
  //clamp the output value to 63
  if(sum > 63) {
    result = 63;
  } else {
    result = sum;
  }

  for(i = 0; i < 3; i++) {
    output[i] = result;
  }
}

The first thing to note is the block header. This specifies that an entity entitled convolute_greyscale should be created; with a 126MHz clock input and rising edge triggered registers. It also specifies that a special enable input should be created for registers and streams (these are described in detail later). There is then the pixel data input itself; specified as an array of 3 6-bit unsigned integers and entitled input. Any other data inputs would be added to this first argument list. The second argument list following the => is the list of outputs. In this case there is only one output, entitled output and also an array of 3 6-bit integers. The register keyword specifies an output register should be added. In order to better integrate with many existing designs; arrays in the I/O lists are packed into single a std_logic_vector (in this case 18 bits wide), with the most significant bits being the highest array index.

The next important feature to note is the stream2d data type; which was added in order to simplify many image processing tasks. It creates a 2D moving window of a given size and with a given number of pixels between lines. In this example, greyscale image data is shifted into the stream2d window every clock cycle when enable is high (the block is externally wired such that) enable is low during blanking periods). The elements of window can be accessed like a normal 2D array.

Other than that, the code closely follows C; the main differences being the arbitary width signed and unsigned types, and a few workaround as I didn't have time to write a fully compliant parser (such as a lack of support for unary minus, 0- is used instead). The other important feature of note is that integer operations always preduce a result with a width sufficiently large that it is guaranteed not to overflow, minimising the amount of casting needed.

Some additional features were added later on in order to further neaten simple image processing tasks. The first implemented was functions, with operator overloading and a auto type enabling simple generic programming. With the addition of #include support, a simple library of common image processing functions could be created. struct support support was also added, with structure types in input/output lists being packed into std_logic_vector ports in a similar way to arrays. Thus a pixel, for example, could be represented using the RGB666 type, which is 6 bits per channel, and with R in the MSB. This would become an 18-bit wide port.

static variables were also added. Declaring a variable as static means it is preserved between clock cycles, useful for tasks such as finding average or min/max values. A reset input can also be added which resets static variables to the value in their initialiser.

Support for ROMs was also added, primarily to enable rendering fonts or textures. ROMs could either be added in the body of the block and initialised from a constant array (a script was written to create this array from an image file). Alternatively they could be added to the input list, in which case an address output and data input is created to enable interfacing with memory external to the block. This was most useful with dual port block-RAM devices, as data could then be written in from another part of the design. In our project we used this to share game map data between the Nios core and the HLS block; and interface with a framebuffer filled from the HDMI input. For both types of ROMs, 1-cycle latency from address to data is accounted and compensated for during pipelining.

When written to take advantage of some of the new features, the convolution example above would now look like:

#include "libelastic/convolution.h"
#include "libelastic/math.h"
#include "libelastic/pixel_types.h"

block convolute_greyscale(clock<rising,84e6>, enable, RGB666 input) => (register RGB666 output) {
  int i;
  stream2d<unsigned<6>,5,5,1024> greyscale;
  //Convert pixel to greyscale, and push it into the stream
  greyscale << to_greyscale(input);

  //convolute with horizontal and vertical sobel filters
  signed<16> horiz = convolute_5x5(greyscale, sobel_horizontal_5x5);
  signed<16> vert = convolute_5x5(greyscale, sobel_vertical_5x5);

  unsigned<6> result = clamp((abs(horiz) + abs(vert)) / 32, 63);

  from_greyscale(result, output);
}

The RGB666 type and corresponding to_greyscale and from_greyscale functions are designed to make manipulating pixels easier, and included in the library file pixel_types.h. convolution.h contains the convolute_5x5 function for convolution and some standard convolution matrices (Sobel, Gauss, etc).

Systems developed using the tool during the course of our project include perspective transformation and marker recognition. The streaming, single-cycle nature of the tool was ideal for our project, an augmented reality headset, as it meant the latency of the processing blocks was very low. The tool processed all these blocks in well under a second, unlike the commercial tool we used which could take several minutes.

Implementation

Being primarily a proof-of-concept and rapid prototyping tool, the first implementation of the HLS tool is not particularly neat. The code is parsed using a simple parser (with expression parsing done with a modified version of the Shunting-Yard algorithm) and loop unrolling is done at the time of parsing. This converts the block into a large "evaluation tree" which gives each output in terms of the inputs; with tree nodes representing operations, typecasts, conditionals, inputs or constants. In the future, as I develop the code further, I plan to seperate out the parser and the evaluation tree generator further to enable flexibility in the strategy used to process the design.

There are a few special cases. In particular the value assigned to a static variable; the value inserted into a stream; and the address for internal ROMs must all be created as 'hidden' outputs and therefore have a tree attached to them. Likewise the current value of static variables, the content of a stream and the data output of a ROM are all created as 'hidden' inputs for analysis purposes; and can therefore appear as tree nodes.

Once these trees have been generated, most of the required processing is fairly simple. It consists first of constant folding; some simple optimisation; and balancing. Division by constant is also converted to a constant multiplication and shift. Subsequently timing analysis and pipelining is performed, which is described in detail in the next section. This results in an extra type of tree node, the pipeline register, being inserted into the tree.

The block is now ready to be output as VHDL. Apart from some special statements needed for inputs; outputs and 'special' variables (static variables, streams and ROMs), generation is fairly simple. The value of each tree item is a VHDL signal, and each tree item can be converted to a suitable VHDL statement assigning a value to this signal. A fairly significant amount of type casting is needed in order to match the behaviour of the HLS types, where operations have a result type such that they will never overflow. All internal calculation is done using the VHDL numeric_std signed and unsigned type; all external IOs use std_logic_vector types.


Timing and Pipelining

One of the main features of the tool is that pipeline registers are automatically inserted as needed in order to ensure that the synthesised design passes timing. Before pipeling register insertion, a timing model must first be used to work out the maximum 'clock-to-valid' delay at each point. This is perfomed on the tree structure described above, with a delay assigned to the output of each tree node.

The method used to calculate this delay varies depending on the tree item type. For inputs, ROM/stream values, and pipeline register items (the latter primarily applicable when timing analysis is re-run later on); this delay is a constant value not affected by the delay at any input to the tree item. For operations and conditional tree items the delay at the output is the maximum delay at any of the inputs plus the propogation delay of the operation.

All operations and conditionals represent a simple device such as a adder, multiplier or mux. The propogation delay of these devices was obtained by constructing simple systems in VHDL containing one device surrounded by registers and running timing analysis in Quartus (the original target of the tool was an Altera Cyclone III FPGA). From this a model could be built for each type of operation specifying hown to calculate propogation delay based on the widths of the inputs.

Once timing analysis has been perfomed, pipeline registers must now be added as required. The algorithm at present inserts pipeline registers at the inputs to a tree item if the 'clock-to-valid' delay at the output exceeds the allowed timing budget. The timing budget is by default 0.75 or 0.85 (0.75 is chosen for higher frequencies) times the clock period; however this factor can be adjusted and constant slack added as options. The latency at the output of any tree item is also recorded; and registers will be inserted at the inputs of tree items to ensure that the latency at all of the inputs is the same. This technique also handles the one-cycle read latency of ROM devices; as the output will be recorded as having a latency once cycle higher than the address input, so latency balancing registers will be added as needed.

Even with more complicated systems; the output of the tool almost always met timing requirements. The downside of the simple model is that it does not take into account how the design will be placed and routed; occasionally designs with very high utilisation would not meet timing requirements as a result. However even in these cases the TNS was small and the problem could be solved by adjusting the amount of slack the tool adds.


State Machine Generation

As well as the single-cycle synthesis where loops are always unrolled, I also experimented with multi-cycle synthesis. This converted the code to state machines instead of the evaluation tree structure.

The basic strategy was to split the code into basic operations; work out which operations can occur in parallel; and then each state consists of operations that can be performed in parallel. There are some additional complications as for loops and if statements are converted into their own state machines, which will be merged into the main state machine unless two or more of these statements are chosen to occur in the same state.

In addition to supporting loops that cannot be unrolled; compared to the single-cycle strategy this option allows RAM type memory and more than one access to memory in the block.

In greater detail, the flow of operations to convert the multi-cycle design to a state machine is as follows:

Although the resulting designs functioned acceptably, the state machine generation was far from optimal in terms of performance - particularly resource usage. Although it could be improved, time limits mean it will most likely remain nothing more than an experiment at present. I do however hope to include some multi-cycle functionality into the single-cycle generator by allowing individual loops not to be unrolled. This could be used, for example, in a loop that zeros or copies a memory device at the start of a frame.


Lower-Level Logic Synthesis

As well as experimenting with state machine generation; I also experimented in integrating lower-level logic synthesis with the HLS tool. This would allow the design to be synthesised all the way to device primitives. At minimum these would be LUTs and flip-flops; however multipliers and fast carry logic could also be synthesised if supported.

The first step in the process is that the HLS tool is configured to generate a simplified netlist format rather than VHDL, consisting of a series of buses connecting basic operations such as add, subtract, conditional, etc. together. In addition pipeline register insertion is not performed during HLS, instead it is performed later on during the low-level synthesis.

The low-level synthesis part of the tool then reads in this netlist and converts it to a set of basic logic devices (primarily LUTs at this point). This is done by having a function that converts all the basic operations into a series of LUTs, containing rules on how to synthesise add, multiply, divide, etc. Device 'technology' classes can also override synthesis of operations to incoporate dedicated custom logic such as fast carry routing/lookahead or dedicated multipliers. Operations such as division are defined in terms of simpler operations such as add and subtract so that device-optimised constructs will automatically be used if available.

The end result of this synthesis is effectively a digital logic circuit. This circuit is then optimised by applying a set of optimisation rules. Application of these rules is repeated in a loop until no further optimisations occur.

Once the synthesised design has been optimised, timing analysis and pipelining is performed if required. This works using a very similar method to pipeline register insertion in the HLS as described above. The primary difference is that it uses a lower-level timing model based around basic device timings instead of higher-level operations. It still is somewhat limited by the lack of any accounting for variable routing delays depending on how place and route treats the design.

The pipelined design is then output as a VHDL netlist, using the device vendor's primitives library. At the moment, this output file can get very large (in one case, exceeded 100k lines) and slow to process in the vendor's toolchain. As a result I am investigating other vendor-specific output options such as Xilinx's XDL as an alternative.

Although this tool can synthesise small designs adequately up to the scale of ~5000 LUTs (such as the real-time perspective transformation block); its performance is at present not great. Resource usage is around 10-20% greater than using HLS only and the vendor's synthesis tool; however timing performance (thus pipeline latency) can be 50-70% worse. This is most likely because the current strategy does not perform any timing-focused optimisations such as retiming or any kind of balancing.

As well as the above tweaks a more advanced timing model would be needed for these optimisations; ideally taking into account at least to an extent how the design will be placed and therefore what routing delays will be. At the moment the delays within and between LUTs are considered constant; however LUT-FF packing is taken into account by removing the routing delay allowance when a flip flop D input is driven by the output of a LUT.


Back to Index