a VHDL implementation of squaring

The VHDL module "square" (see symbol) calculates the square of a signed number.

The module uses an optimized multiplication algorithm in order to get a smaller (about 33%) and faster (about 25%) design compared to a multiplication module.

The number of bits of the operand is configured by a generic.

The operand and the result (which is always positive) are numbers in 2's complement format.

The module uses flipflops only for storing the result (and for controlling).

For quick access to the operand bits, the operand is first stored in the result flipflops and

then replaced by the upcoming square bits through shift operations.

The latency of the module can be configured by a generic independently from the width of the operand.

This means, the module is configurable by generics in order

- to fulfill any requirements regarding the number of bits of the operand and
- to fulfill any requirements regarding its latency.

for the generics, as the timing depends on the technology which is used at synthesis.

The module "square" was developed with HDL-SCHEM-Editor.

Port name | Direction | Description |
---|---|---|

res_i | input | asynchronous reset input, 1-active Can be clamped to 0 when g_latency=0. |

clk_i | input | clock input Can be clamped to 0 when g_latency=0. |

start_i | input | This input expects an 1-active impulse of 1 clock cycle width in order to start the calculation. When g_latency=0 then back to back pulses can be used. |

operand_i(g_operand_width-1:0) | input | Signed operand (g_operand_width is a generic). The input must be stable during the calculation. |

ready_o | output | 1-active impulse of 1 clock cycle width, when the calculation is ready (at latency 0 it gets active in the same clock cycle in which start_i gets active). |

square_o(2*g_operand_width-1:0) | output | Signed square (always positive). Valid at ready_o=1. Not stable during calculation. |

Generic name | Minimum Value | Maximum Value | Description |
---|---|---|---|

g_operand_width | 2 | none | Number of bits of the operand The first bit represents the sign as the operand has to be coded in 2's complement. |

g_latency | 0 | none | Latency of the module in clock cycles When g_latency is 0, then the module square is a combinatorial design. |

The module "square" is a hierarchical module, which is built by two submodules.

Submodule name | Functionality |
---|---|

square_step |
The module square_step processes 1 bit of the algorithm. |

square_control |
The square_control modules generates all the control signals which are needed. |

The module "square" was designed to have a faster and smaller solution for calculating the square compared to a multiplication module.

The improvement is based on an algorithm which needs a 1 bit adder for the first step and then 1 additional adder at each next step.

To see the full improvement (about 25% faster and 33% smaller) the generic g_latency must have the value 0 or 1,

because in both cases an individual submodule square_step is instantiated for each step of the algorithm,

so the adder inside this module can have its minimal size.

At the moment g_latency is bigger than 1 a reuse of the square_step modules happens and the improvement will be smaller:

Each square_step module is then used in more than 1 clock cycle and must be able to handle the partial result with the biggest width.

The worst case is at g_latency=g_operand_width, where no improvement compared to a multiplier module remains.

There is no limitation for the generic g_operand_width (except that it must be bigger than 1).

There is also no limitation for the generic g_latency. But this generic determines not only the latency but also

how difficult it will get reaching timing closure: The smaller the value is chosen, the harder it will get reaching timing closure.

If g_latency is equal to g_operand_width, then in each clock cycle 1 step of the algorithm is handled.

If g_latency is smaller than g_operand_width, then in each clock cycle more than step of the algorithm is handled.

How many steps are handled can be calculated by rounding up g_operand_width/g_latency to the next integer.

Be aware that handling more than 1 step of the algorithm in a clock cycle may prevent reaching timing closure.

If g_latency is bigger than g_operand_width, then the number of bits of the operand is internally increased to g_latency and

again in each clock cycle 1 bit of the (extended) operand is handled.

This is not a recommended configuration, as it needs additional gate resources and makes reaching timing closure more difficult.

When the square of a positive number A with the value "abcd" (with a,b,c,d from 0,1) has

to be calculated, we could use the common usual multiplication scheme:

A^2 = d*"0000abcd" +

c*"000abcd0" +

b*"00abcd00" +

a*"0abcd000"

For implementation 4 adders with each 4 bit would be needed.

But this effort can be reduced as the 2 numbers have the same digits.

In order to show a method to reduce the effort, the number A is written as follows:

A = a*2^3 + b*2^2 + c*2^1 + d

When we calculate A^2 the result is (be aware: a^n=a, b^n=b,...):

A^2 = a*2^3 * (a*2^3 + b*2^2 + c*2^1 + d) +

b*2^2 * (a*2^3 + b*2^2 + c*2^1 + d) +

c*2^1 * (a*2^3 + b*2^2 + c*2^1 + d) +

d * (a*2^3 + b*2^2 + c*2^1 + d) =

= a*2^6 + a*b*2^5 + a*c*2^4 + a*d*2^3 +

a*b*2^5 + b*2^4 + b*c*2^3 + b*d*2^2 +

a*c*2^4 + b*c*2^3 + c*2^2 + c*d*2^1 +

a*d*2^3 + b*d*2^2 + c*d*2^1 + d =

= a*2^6 + b*2^4 + c*2^2 + d +

a*b*2^6 + a*c*2^5 + a*d*2^4 + b*c*2^4 + b*d*2^3 + c*d*2^2 =

= a*2^6 + b*2^4 + c*2^2 + d +

a*(b*2^6 + c*2^5 + d*2^4)+ b*(c*2^4 + d*2^3)+ c*d*2^2 =

= a*2^6 + 0*2^5 + b*2^4 + 0*2^3 + c*2^2 + 0*2^1 + d +

a*(b*2^6 + c*2^5 + d*2^4)+

b*(c*2^4 + d*2^3)+

c*(d*2^2) =

= c*"0000d00" +

b*"00cd000" +

a*"bcd0000" +

"a0b0c0d"

In case of 8 bit the scheme looks like:

A^2 = 000000000000000 + Adding 0 makes no sense here, but is needed later.

g*000000000000h00 +

f*0000000000gh000 +

e*00000000fgh0000 +

d*000000efgh00000 +

c*0000defgh000000 +

b*00cdefgh0000000 +

a*bcdefgh00000000 +

a0b0c0d0e0f0g0h

Now the scheme is expanded by showing the intermediate sums:

A^2 = 000000000000000 +

g*000000000000h00 =

---------------

0h00 +

f*0000000000gh000 =

---------------

0ghh00 +

e*00000000fgh0000 =

---------------

sssshh00 +

d*000000efgh00000 =

---------------

sssssshh00 +

c*0000defgh000000 =

---------------

sssssssshh00 +

b*00cdefgh0000000 =

---------------

sssssssssshh00 +

a*bcdefgh00000000 =

---------------

ssssssssssshh00 +

a0b0c0d0e0f0g0h = "spread" operand

---------------

SSSSSSSSSSSSSSS

Now the bits of the "spread" operand are shifted to other places and the last addition is removed:

A^2 = 000000000000g00 + Sum-bits which have reached its end result are named 'S'.

g*000000000000h00 =

---------------

fsS00 +

f*0000000000gh000 =

---------------

essSS00 +

e*00000000fgh0000 =

---------------

dsssSSS00 +

d*000000efgh00000 =

---------------

cssssSSSS00 +

c*0000defgh000000 =

---------------

bsssssSSSSS00 +

b*00cdefgh0000000 =

---------------

assssssSSSSSS00 +

a*bcdefgh0000000h =

---------------

SSSSSSSSSSSSS0h

With the common multiplication scheme, a negative multiplicand (in 2's complement) is handled correctly,

but a negative multiplier, which is handled bit by bit as usual, is too big by 2^n,

and the product must be reduced by 2^n*multiplicand at the end.

But the situation is different in the optimized square scheme. Here both number are interpreted as positive numbers.

If the operand k is a negative integer, the binary value of k (in 2's complement) was calculated by 2^n+k,

where n is the number of bits of the 2's complement. So the square algorithm calculates:

square = (2^n + k)^2 =

= 2^2*n + k*2^(n+1) + k^2

As square has 2*n-1 bits at maximum, the summand 2^2*n can be ignored.

But the summand k*2^(n+1) must be subtracted from the result.

In this 8 bit example the value k*2^(n+1), limited to 15 bit, calculates to:

fix = k*2^(n+1) = abcdefgh * 2^9 = cdefgh000000000

As this value must be subtracted, the 2's complement of the value must be added:

-fix = not (cdefgh000000000) + 1

This new addition, depending on bit a which signals a negative number, is added last to the algorithm:

A^2 = 000000000000g00 + Sum-bits which have reached its end result are named 'S'.

g*000000000000h00 =

---------------

fsS00 +

f*0000000000gh000 =

---------------

essSS00 +

e*00000000fgh0000 =

---------------

dsssSSS00 +

d*000000efgh00000 =

---------------

cssssSSSS00 +

c*0000defgh000000 =

---------------

bsssssSSSSS00 +

b*00cdefgh0000000 =

---------------

assssssSSSSSS00 +

a*bcdefgh0000000h =

---------------

ssssssSSSSSSS0h

a * not(cdefgh) <- CarryIn=a at the least significant bit

---------------

SSSSSSSSSSSSS0h

In each addition parts of the original operand are shifted and added.

In order to avoid shifting the operand to the left, the intermediate sums are shifted right.

Least signficant bits with always value 0 are not shown any more:

A^2 = 000000g + Sum-bits which have reached its end result are named 'S'.

g*000000h =

-------

00000sS -> shift

fsS +

f*00000gh =

-------

0000ssS -> shift

essS +

e*0000fgh =

-------

000sssS -> shift

dsssS +

d*000efgh =

-------

00ssssS -> shift

cssssS +

c*00defgh =

-------

0sssssS -> shift

bsssssS +

b*0cdefgh =

-------

ssssssS -> shift

assssssS +

a*bcdefgh =

-------

ssssssS -> shift

assssssS +

a * not(bcdefgh) = <- CarryIn=a at the adder for the least significant bit

-------

_SSSSSS.......0h The '.'s are filled with the 'S's calculated before.

Source code for HDL-SCHEM-Editor and HDL-FSM-Editor for module "square" and its testbench (Number of downloads =
6 ).

With these files the schematics and the state-diagram of module square can be loaded into HDL-SCHEM-Editor or
HDL-FSM-Editor and can be easily read and modified:

All module VHDL-files of the module "square" (Number of downloads =
7 ).

These files were generated by HDL-SCHEM-Editor and HDL-FSM-Editor:

All testbench VHDL-files of the module "square" (Number of downloads =
6 ).

These files were generated by HDL-SCHEM-Editor and HDL-FSM-Editor:

You should extract all archives into a folder named "square".

Then you must replace "M:/gesicherte Daten/Programmieren/VHDL/square" in all "hdl_editor_designs/*.hse" source files by your path to this directory.

Now you can navigate through the design by HDL-SCHEM-Editor and generate HDL by HDL-SCHEM-Editor for all modules except square_control,

for which the HDL must be generated by HDL-FSM-Editor.

Of course there is only need for generating HDL, if you change something at the modules, because you can find the HDL in VHDL_designs.zip and VHDL_testbenches.zip.

If you want to simulate or modify the modules by HDL-SCHEM-Editor you also must adapt the information in the Control-tab of the toplevel you want to work on.

There you must define a "Compile through hierarchy command", an "Edit command", the path to your HDL-FSM-Editor and a "Working directory".

Version 1.0 (18.10.2024):

- Initial version

If you detect any bugs or have any questions,

please send a mail to "matthias.schweikart@gmx.de".