Experiments with FPGA Logic Synthesis - Project ElasticC

Introduction
A C-like Language for basic HLS (ElasticC)
Timing and Pipelining
State Machine Generation
Lower-Level Logic Synthesis

Introduction

An interest in FPGA synthesis began with a university group project entitled "Realtime Image Processing using an FPGA". For this we were supposed to use a commercial HLS tool, Catapult C. However as soon as I started using it felt it was clunky and sub-optimal for simple streaming single-cycle applications like our project. As a result I decided I would have a go at building a simple tool to synthesise a C-like language to RTL; automatically inserting pipeline registers as required to pass timing for a given operating frequency (the idea of building this was not taken particularly seriously by other members of the group at first). In its simplest form it would be designed for applications where all loops can be unrolled at compile time - this was not actually a major limitation for the purposes of our project. This tool sped up the project itself significantly as it allowed ideas to be prototyped quickly and easily.

After developing an initial prototype for the single-cycle case, and continuing to add features to it, I also continued to experiment further. This included functionality to convert designs where loops couldn't be unrolled to state machines. I then experimented with synthesising simple designs to the level of device primitives - LUTs, single flip-flops, and fast carry logic/multipliers if supported by the target device.

The new, neater codebase, which while is not yet as capable as what is described here, is being actively developed and is published as open source (MIT License) on GitHub.

A C-like Language for basic HLS

Language Overview

The initial language was designed to closely follow C syntax with some cues also taken from C++; but modified and extended to better support use cases we were interested in. Unlike other more advanced HLS tools; I decided to prioritise ease of use and tidiness over strict compatibility with exisitng C code.

A typical block - which synthesises to a VHDL entity - is this example of converting a pixel stream to greyscale then performing horizontal and vertical Sobel edge detection on it. This would be converted to a pipelined VHDL entity which processes one new pixel each clock cycle.

block convolute_greyscale(clock<rising,56e6>, enable, unsigned<6>[3] input) => (register unsigned<6>[3] output) {
  int i = 0, j = 0, k = 0;

  stream2d<unsigned<6>,5,5,1024> greyscale;
  //Convert pixel to greyscale, and push it into the stream
  greyscale << ((input[0] + input[1] + input[2]) / 3);

  //5x5 sobel matrices
  signed<5>[5][5] m1 = {{1, 2, 0, -2, -1},
                        {4, 8, 0, -8, -4},
                        {6, 12, 0, -12, -6},
                        {4, 8, 0, -8, -4},
                        {1, 2, 0, -2, -1}};

  signed<5>[5][5] m2 = {{-1, -4, -6, -4, -1},
                        {-2, -8, -12, -8, -2},
                        {0, 0, 0, 0, 0},
                        {2, 8, 12, 8, 2},
                        {1, 4, 6, 4, 1}};
  //convolute with horizontal and vertical sobel filters
  signed<16> sum1 = 0, sum2 = 0;
  for(j = 0; j < 5; j++) {
    for(k = 0; k < 5; k++) {
      sum1 += greyscale[j][k] * m1[j][k];
      sum2 += greyscale[j][k] * m2[j][k];
    }
  }
  //take the absolute value of both sums
  if(sum1 < 0) {
    sum1 = 0-sum1;
  }
  if(sum2 < 0) {
    sum2 = 0-sum2;
  }
  //add the absolute values of the two convolution results,
  //and normalise by dividing by 32
  unsigned<9> sum = (sum1 + sum2) >> 5;
  unsigned<6> result;
  //clamp the output value to 63
  if(sum > 63) {
    result = 63;
  } else {
    result = sum;
  }

  for(i = 0; i < 3; i++) {
    output[i] = result;
  }
}

The first thing to note is the block header. This specifies that an entity entitled convolute_greyscale should be created; with a 126MHz clock input and rising edge triggered registers. It also specifies that a special enable input should be created for registers and streams (these are described in detail later). There is then the pixel data input itself; specified as an array of 3 6-bit unsigned integers and entitled input. Any other data inputs would be added to this first argument list. The second argument list following the => is the list of outputs. In this case there is only one output, entitled output and also an array of 3 6-bit integers. The register keyword specifies an output register should be added. In order to better integrate with many existing designs; arrays in the I/O lists are packed into single a std_logic_vector (in this case 18 bits wide), with the most significant bits being the highest array index.

The next important feature to note is the stream2d data type; which was added in order to simplify many image processing tasks. It creates a 2D moving window of a given size and with a given number of pixels between lines. In this example, greyscale image data is shifted into the stream2d window every clock cycle when enable is high (the block is externally wired such that) enable is low during blanking periods). The elements of window can be accessed like a normal 2D array.

Other than that, the code closely follows C; the main differences being the arbitary width signed and unsigned types, and a few workaround as I didn't have time to write a fully compliant parser (such as a lack of support for unary minus, 0- is used instead). The other important feature of note is that integer operations always preduce a result with a width sufficiently large that it is guaranteed not to overflow, minimising the amount of casting needed.

Some additional features were added later on in order to further neaten simple image processing tasks. The first implemented was functions, with operator overloading and a auto type enabling simple generic programming. With the addition of #include support, a simple library of common image processing functions could be created. struct support support was also added, with structure types in input/output lists being packed into std_logic_vector ports in a similar way to arrays. Thus a pixel, for example, could be represented using the RGB666 type, which is 6 bits per channel, and with R in the MSB. This would become an 18-bit wide port.

static variables were also added. Declaring a variable as static means it is preserved between clock cycles, useful for tasks such as finding average or min/max values. A reset input can also be added which resets static variables to the value in their initialiser.

Support for ROMs was also added, primarily to enable rendering fonts or textures. ROMs could either be added in the body of the block and initialised from a constant array (a script was written to create this array from an image file). Alternatively they could be added to the input list, in which case an address output and data input is created to enable interfacing with memory external to the block. This was most useful with dual port block-RAM devices, as data could then be written in from another part of the design. In our project we used this to share game map data between the Nios core and the HLS block; and interface with a framebuffer filled from the HDMI input. For both types of ROMs, 1-cycle latency from address to data is accounted and compensated for during pipelining.

When written to take advantage of some of the new features, the convolution example above would now look like:

#include "libelastic/convolution.h"
#include "libelastic/math.h"
#include "libelastic/pixel_types.h"

block convolute_greyscale(clock<rising,84e6>, enable, RGB666 input) => (register RGB666 output) {
  int i;
  stream2d<unsigned<6>,5,5,1024> greyscale;
  //Convert pixel to greyscale, and push it into the stream
  greyscale << to_greyscale(input);

  //convolute with horizontal and vertical sobel filters
  signed<16> horiz = convolute_5x5(greyscale, sobel_horizontal_5x5);
  signed<16> vert = convolute_5x5(greyscale, sobel_vertical_5x5);

  unsigned<6> result = clamp((abs(horiz) + abs(vert)) / 32, 63);

  from_greyscale(result, output);
}

The RGB666 type and corresponding to_greyscale and from_greyscale functions are designed to make manipulating pixels easier, and included in the library file pixel_types.h. convolution.h contains the convolute_5x5 function for convolution and some standard convolution matrices (Sobel, Gauss, etc).

Systems developed using the tool during the course of our project include perspective transformation and marker recognition. The streaming, single-cycle nature of the tool was ideal for our project, an augmented reality headset, as it meant the latency of the processing blocks was very low. The tool processed all these blocks in well under a second, unlike the commercial tool we used which could take several minutes.

Implementation

Being primarily a proof-of-concept and rapid prototyping tool, the first implementation of the HLS tool is not particularly neat. The code is parsed using a simple parser (with expression parsing done with a modified version of the Shunting-Yard algorithm) and loop unrolling is done at the time of parsing. This converts the block into a large "evaluation tree" which gives each output in terms of the inputs; with tree nodes representing operations, typecasts, conditionals, inputs or constants. In the future, as I develop the code further, I plan to seperate out the parser and the evaluation tree generator further to enable flexibility in the strategy used to process the design.

There are a few special cases. In particular the value assigned to a static variable; the value inserted into a stream; and the address for internal ROMs must all be created as 'hidden' outputs and therefore have a tree attached to them. Likewise the current value of static variables, the content of a stream and the data output of a ROM are all created as 'hidden' inputs for analysis purposes; and can therefore appear as tree nodes.

Once these trees have been generated, most of the required processing is fairly simple. It consists first of constant folding; some simple optimisation; and balancing. Division by constant is also converted to a constant multiplication and shift. Subsequently timing analysis and pipelining is performed, which is described in detail in the next section. This results in an extra type of tree node, the pipeline register, being inserted into the tree.

The block is now ready to be output as VHDL. Apart from some special statements needed for inputs; outputs and 'special' variables (static variables, streams and ROMs), generation is fairly simple. The value of each tree item is a VHDL signal, and each tree item can be converted to a suitable VHDL statement assigning a value to this signal. A fairly significant amount of type casting is needed in order to match the behaviour of the HLS types, where operations have a result type such that they will never overflow. All internal calculation is done using the VHDL numeric_std signed and unsigned type; all external IOs use std_logic_vector types.

Last updated 2018-03-21. Please email dave@ds0.me with comments, questions, etc.

Experiments with FPGA Logic Synthesis - Project ElasticC

Table of Contents

Introduction

A C-like Language for basic HLS

Language Overview

Implementation

Timing and Pipelining

State Machine Generation

Lower-Level Logic Synthesis