square: a VHDL module which implements squaring

The VHDL module "square" (see symbol) calculates the square of a signed number.

The module uses an optimized multiplication algorithm in order to get a smaller (about 33%) and faster (about 25%) design compared to a multiplication module.
The number of bits of the operand is configured by a generic.
The operand and the result (which is always positive) are numbers in 2's complement format.
The module uses flipflops only for storing the result (and for controlling).
For quick access to the operand bits, the operand is first stored in the result flipflops and
then replaced by the upcoming square bits through shift operations.
The latency of the module can be configured by a generic independently from the width of the operand.

This means, the module is configurable by generics in order

to fulfill any requirements regarding the number of bits of the operand and
to fulfill any requirements regarding its latency.

But of course there is no guarantee that timing closure can be reached with the selected values
for the generics, as the timing depends on the technology which is used at synthesis.

The module "square" was developed with HDL-SCHEM-Editor.

Ports:

Port name	Direction	Description
res_i	input	asynchronous reset input, 1-active Can be clamped to 0 when g_latency=0.
clk_i	input	clock input Can be clamped to 0 when g_latency=0.
start_i	input	This input expects an 1-active impulse of 1 clock cycle width in order to start the calculation. When g_latency=0 then back to back pulses can be used.
operand_i(g_operand_width-1:0)	input	Signed operand (g_operand_width is a generic). The input must be stable during the calculation.
ready_o	output	1-active impulse of 1 clock cycle width, when the calculation is ready (at latency 0 it gets active in the same clock cycle in which start_i gets active).
square_o(2*g_operand_width-1:0)	output	Signed square (always positive). Valid at ready_o=1. Not stable during calculation.

Generics:

Generic name	Minimum Value	Maximum Value	Description
g_operand_width	2	none	Number of bits of the operand The first bit represents the sign as the operand has to be coded in 2's complement.
g_latency	0	none	Latency of the module in clock cycles When g_latency is 0, then the module square is a combinatorial design.

The module "square" is a hierarchical module, which is built by two submodules.

Submodule name	Functionality
square_step	The module square_step processes 1 bit of the algorithm. The algorithm has as many steps as the operand has data bits. One additional step is needed, which fixes the result if the operand was a negative number. So in sum the algorithm has as many steps as the operand has bits. The module square_step is instantiated as often as steps are processed during 1 clock cycle (which depends from the generic g_latency).
square_control	The square_control modules generates all the control signals which are needed. It enables the internal registers for the intermediate or final results. It identifies the clock period in which the sign bit of the operand is handled. It activates the ready_o output at the end of the calculation.

Submodule name

Functionality

square_step

The module square_step processes 1 bit of the algorithm.
The algorithm has as many steps as the operand has data bits.
One additional step is needed, which fixes the result if the operand was a negative number.
So in sum the algorithm has as many steps as the operand has bits.
The module square_step is instantiated as often as steps are processed during 1 clock cycle (which depends from the generic g_latency).

square_control

The square_control modules generates all the control signals which are needed.
It enables the internal registers for the intermediate or final results.
It identifies the clock period in which the sign bit of the operand is handled.
It activates the ready_o output at the end of the calculation.

The module "square" was designed to have a faster and smaller solution for calculating the square compared to a multiplication module.
The improvement is based on an algorithm which needs a 1 bit adder for the first step and then 1 additional adder at each next step.
To see the full improvement (about 25% faster and 33% smaller) the generic g_latency must have the value 0 or 1,
because in both cases an individual submodule square_step is instantiated for each step of the algorithm,
so the adder inside this module can have its minimal size.
At the moment g_latency is bigger than 1 a reuse of the square_step modules happens and the improvement will be smaller:
Each square_step module is then used in more than 1 clock cycle and must be able to handle the partial result with the biggest width.
The worst case is at g_latency=g_operand_width, where no improvement compared to a multiplier module remains.

There is no limitation for the generic g_operand_width (except that it must be bigger than 1).

There is also no limitation for the generic g_latency. But this generic determines not only the latency but also
how difficult it will get reaching timing closure: The smaller the value is chosen, the harder it will get reaching timing closure.

If g_latency is equal to g_operand_width, then in each clock cycle 1 step of the algorithm is handled.

If g_latency is smaller than g_operand_width, then in each clock cycle more than step of the algorithm is handled.
How many steps are handled can be calculated by rounding up g_operand_width/g_latency to the next integer.
Be aware that handling more than 1 step of the algorithm in a clock cycle may prevent reaching timing closure.

If g_latency is bigger than g_operand_width, then the number of bits of the operand is internally increased to g_latency and
again in each clock cycle 1 bit of the (extended) operand is handled.
This is not a recommended configuration, as it needs additional gate resources and makes reaching timing closure more difficult.

When the square of a positive number A with the value "abcd" (with a,b,c,d from 0,1) has
to be calculated, we could use the common usual multiplication scheme:

A^2 = d*"0000abcd" +
c*"000abcd0" +
b*"00abcd00" +
a*"0abcd000"

For implementation 4 adders with each 4 bit would be needed.
But this effort can be reduced as the 2 numbers have the same digits.
In order to show a method to reduce the effort, the number A is written as follows:

A = a*2^3 + b*2^2 + c*2^1 + d

When we calculate A^2 the result is (be aware: a^n=a, b^n=b,...):

A^2 = a*2^3 * (a*2^3 + b*2^2 + c*2^1 + d) +
b*2^2 * (a*2^3 + b*2^2 + c*2^1 + d) +
c*2^1 * (a*2^3 + b*2^2 + c*2^1 + d) +
d * (a*2^3 + b*2^2 + c*2^1 + d) =

= a*2^6 + a*b*2^5 + a*c*2^4 + a*d*2^3 +
a*b*2^5 + b*2^4 + b*c*2^3 + b*d*2^2 +
a*c*2^4 + b*c*2^3 + c*2^2 + c*d*2^1 +
a*d*2^3 + b*d*2^2 + c*d*2^1 + d =

= a*2^6 + b*2^4 + c*2^2 + d +
a*b*2^6 + a*c*2^5 + a*d*2^4 + b*c*2^4 + b*d*2^3 + c*d*2^2 =

= a*2^6 + b*2^4 + c*2^2 + d +
a*(b*2^6 + c*2^5 + d*2^4)+ b*(c*2^4 + d*2^3)+ c*d*2^2 =

= a*2^6 + 0*2^5 + b*2^4 + 0*2^3 + c*2^2 + 0*2^1 + d +
a*(b*2^6 + c*2^5 + d*2^4)+
b*(c*2^4 + d*2^3)+
c*(d*2^2) =

= c*"0000d00" +
b*"00cd000" +
a*"bcd0000" +
"a0b0c0d"

In case of 8 bit the scheme looks like:

A^2 = 000000000000000 + Adding 0 makes no sense here, but is needed later.
g*000000000000h00 +
f*0000000000gh000 +
e*00000000fgh0000 +
d*000000efgh00000 +
c*0000defgh000000 +
b*00cdefgh0000000 +
a*bcdefgh00000000 +
a0b0c0d0e0f0g0h

Now the scheme is expanded by showing the intermediate sums:

A^2 = 000000000000000 +
g*000000000000h00 =
---------------
0h00 +
f*0000000000gh000 =
---------------
0ghh00 +
e*00000000fgh0000 =
---------------
sssshh00 +
d*000000efgh00000 =
---------------
sssssshh00 +
c*0000defgh000000 =
---------------
sssssssshh00 +
b*00cdefgh0000000 =
---------------
sssssssssshh00 +
a*bcdefgh00000000 =
---------------
ssssssssssshh00 +
a0b0c0d0e0f0g0h = "spread" operand
---------------
SSSSSSSSSSSSSSS

Now the bits of the "spread" operand are shifted to other places and the last addition is removed:

A^2 = 000000000000g00 + Sum-bits which have reached its end result are named 'S'.
g*000000000000h00 =
---------------
fsS00 +
f*0000000000gh000 =
---------------
essSS00 +
e*00000000fgh0000 =
---------------
dsssSSS00 +
d*000000efgh00000 =
---------------
cssssSSSS00 +
c*0000defgh000000 =
---------------
bsssssSSSSS00 +
b*00cdefgh0000000 =
---------------
assssssSSSSSS00 +
a*bcdefgh0000000h =
---------------
SSSSSSSSSSSSS0h

With the common multiplication scheme, a negative multiplicand (in 2's complement) is handled correctly,
but a negative multiplier, which is handled bit by bit as usual, is too big by 2^n,
and the product must be reduced by 2^n*multiplicand at the end.
But the situation is different in the optimized square scheme. Here both number are interpreted as positive numbers.
If the operand k is a negative integer, the binary value of k (in 2's complement) was calculated by 2^n+k,
where n is the number of bits of the 2's complement. So the square algorithm calculates:

square = (2^n + k)^2 =
= 2^2*n + k*2^(n+1) + k^2

As square has 2*n-1 bits at maximum, the summand 2^2*n can be ignored.
But the summand k*2^(n+1) must be subtracted from the result.

In this 8 bit example the value k*2^(n+1), limited to 15 bit, calculates to:

fix = k*2^(n+1) = abcdefgh * 2^9 = cdefgh000000000

As this value must be subtracted, the 2's complement of the value must be added:

-fix = not (cdefgh000000000) + 1

This new addition, depending on bit a which signals a negative number, is added last to the algorithm:

A^2 = 000000000000g00 + Sum-bits which have reached its end result are named 'S'.
g*000000000000h00 =
---------------
fsS00 +
f*0000000000gh000 =
---------------
essSS00 +
e*00000000fgh0000 =
---------------
dsssSSS00 +
d*000000efgh00000 =
---------------
cssssSSSS00 +
c*0000defgh000000 =
---------------
bsssssSSSSS00 +
b*00cdefgh0000000 =
---------------
assssssSSSSSS00 +
a*bcdefgh0000000h =
---------------
ssssssSSSSSSS0h
a * not(cdefgh) <- CarryIn=a at the least significant bit
---------------
SSSSSSSSSSSSS0h

In each addition parts of the original operand are shifted and added.
In order to avoid shifting the operand to the left, the intermediate sums are shifted right.
Least signficant bits with always value 0 are not shown any more:

A^2 = 000000g + Sum-bits which have reached its end result are named 'S'.
g*000000h =
-------
00000sS -> shift
fsS +
f*00000gh =
-------
0000ssS -> shift
essS +
e*0000fgh =
-------
000sssS -> shift
dsssS +
d*000efgh =
-------
00ssssS -> shift
cssssS +
c*00defgh =
-------
0sssssS -> shift
bsssssS +
b*0cdefgh =
-------
ssssssS -> shift
assssssS +
a*bcdefgh =
-------
ssssssS -> shift
assssssS +
a * not(bcdefgh) = <- CarryIn=a at the adder for the least significant bit
-------
_SSSSSS.......0h The '.'s are filled with the 'S's calculated before.

Source code for HDL-SCHEM-Editor and HDL-FSM-Editor for module "square" and its testbench (Number of downloads = 252 ).
With these files the schematics and the state-diagram of module square can be loaded into HDL-SCHEM-Editor or HDL-FSM-Editor and can be easily read and modified:

hdl_editor_designs.zip

All module VHDL-files of the module "square" (Number of downloads = 254 ).
These files were generated by HDL-SCHEM-Editor and HDL-FSM-Editor:

VHDL_designs.zip

All testbench VHDL-files of the module "square" (Number of downloads = 250 ).
These files were generated by HDL-SCHEM-Editor and HDL-FSM-Editor:

VHDL_testbenches.zip

Relocation hints:

You should extract all archives into a folder named "square".

Then you should load the toplevel (probably the testbench) into HDL-SCHEM-Editor.
When you navigate through the design hierarchy by a double click at each symbol,
HDL-SCHEM-Editor will find the submodules on your disk and ask if it can replace
the original path to the submodule by the new one at your disk.
After storing the changed modules the relocation of the source files is ready
(instead you could replace "M:/gesicherte Daten/Programmieren/VHDL/square" in all
"hdl_editor_designs/*.hse" source files by your path to this directory with your editor).

Now you can navigate through the design by HDL-SCHEM-Editor and generate HDL by HDL-SCHEM-Editor for
all modules except square_control, for which the HDL must be generated by HDL-FSM-Editor.
Of course there is only need for generating HDL, if you change something at the modules, because you can find the HDL in VHDL_designs.zip and VHDL_testbenches.zip.

If you want to simulate or modify the modules by HDL-SCHEM-Editor you also must adapt the information in the Control-tab of the toplevel you want to work on.
There you must define a "Compile through hierarchy command", an "Edit command", the path to your HDL-FSM-Editor and a "Working directory".

Change log:

Version 1.0 (18.10.2024):

Initial version

If you detect any bugs or have any questions,
please send a mail to "matthias.schweikart@gmx.de".

The module "square"a VHDL implementation of squaring

Ports:

Generics:

Relocation hints:

Change log:

The module "square"
a VHDL implementation of squaring