square_root: a VHDL module which implements a root calculation

The VHDL module "square_root" (see symbol) calculates the root of a radicand by an exact algorithm.

The module is configurable by generics in order

to fulfill any requirements regarding the number of bits of the radicand and
to fulfill any requirements regarding its latency.

But of course there is no guarantee that timing closure can be reached with the selected values
for the generics, as the timing depends on the technology which is used at synthesis.

The differences to the module "cordic_square_root" (also available at this website) are (when both modules create the same number of root bits):

The number of "cordic_square_root" micro-rotations is only half the number of "square_root" iterations.
When 1 micro-rotation/iteration is done per clock cycle then "cordic_square_root" only needs half the time of "square_root".
But if the root bits after the binary point are not needed then "square_root" is as fast as or
faster than "cordic_square_root" (faster because it works directly on the radicand).
The internal register width of "cordic_square_root" is only half the internal register width of "square_root"
(register-width "cordic_square_root"=g_radicand_width, register-width "square_root"=g_radicand_width+g_additional_bits).
"square_root" can immediately work with the radicand, while "cordic_square_root" must shift and convert the radicand into the input operands of the Hyperbolic Cordic algorithm.
"square_root" can immediately use the calculated root, while "cordic_square_root" must shift the output of the Hyperbolic Cordic algorithm to get the root.
The accuracy of "square_root" is better than the accuracy of "cordic_square_root" (see tab "Accuracy of outputs").
This is because "square_root" uses an exact algorithm while "cordic_square_root" uses an approximation algorithm.

The module "square_root" was developed with HDL-SCHEM-Editor.

Ports:

Port name	Direction	Description
res_i	input	asynchronous reset input, 1-active
clk_i	input	clock input
start_i	input	This input expects an 1-active impulse of 1 clock cycle width in order to start the calculation.
radicand_i(g_radicand_width-1:0)	input	Unsigned radicand (g_radicand_width is a generic) The input value is latched at start=1.
ready_o	output	1-active impulse of 1 clock cycle width, when the calculation is ready (at latency 0 it gets active in the same clock cycle in which start_i gets active).
root_o (g_radicand_width/2-1+ g_additional_root_bits+g_radicand_width mod 2:0)	output	Unsigned root (with g_additional_root_bits beneath the binary point) Valid at ready_o=1 Not stable between 2 impulses at ready_o

Generics:

Generic name	Minimum Value	Maximum Value	Description
g_radicand_width	1	n/a	Number of bits of the radicand
g_additional_root_bits	0	n/a	Number of additional root bits after the binary point
g_latency	0	n/a	Latency of the module

The module "square_root" is a hierarchical module, which is built by 2 submodules.

Submodule name	Functionality
square_root_control	This module enables the internal registers and controls the output ready_o.
square_root_step	The combinatorial submodule square_root_step executes 1 iteration of the algorithm. The module is instantiated several times, when more than 1 iteration has to be executed in 1 clock cycle.

The generic g_latency works in this way:

If the generic g_latency is 0, then the module works combinatorial and all iterations are done in 1 clock cycle.
If the generic g_latency is 1, then the module uses a register for its output value and all iterations are also done in 1 clock cycle.
If the generic g_latency is bigger than 1, then the module again uses a register for its output value, but the iterations are not
all done in one clock cycle but splitted to several clock cycles. How many iterations have to be done in 1 clock cycle is calculated by the
module itself and based on the result a sufficient number of submodules square_root_step is instantiated.

As each iteration calculates 1 bit of the output root_o, as many iterations are needed as root_o has bits.

If g_latency="number of root_o bits"=(g_radicand_width+g_radicand_width mod 2)/2+g_additional_root_bits then
1 iteration has to be done in 1 clock cycle, so the submodule square_root_by_cordic_step is only instantiated once.

If g_latency>(g_radicand_width+g_radicand_width mod 2)/2+g_additional_root_bits then again
1 iteration is done in 1 clock cycle and the submodule square_root_by_cordic_step is only instantiated once.
In this case extra root bits are calculated but not propagated to the output.

If g_latency<(g_radicand_width+g_radicand_width mod 2)/2+g_additional_root_bits then
more than 1 iteration has to be done in 1 clock cycle, so the submodule square_root_by_cordic_step is instantiated several times.
If (g_radicand_width+g_radicand_width mod 2)/2+g_additional_root_bits is not dividable without remainder by g_latency, then
again more than the needed number of root bits are calculated.

The integer part of the calculated root at output root_o is always smaller than or equal to the exact root.
The accuracy can be increased by setting g_additional_root_bits to a value different from 0,
which creates additional bits at root_o beneath the binary point.

The difference between the exact root and the calculated root at output root_o is always less than 2**(-g_additional_root_bits).

In comparison to the module "cordic_square_root" (also available at this website) the module "square_root" gives better accuracy,
especially when g_radicand_width is bigger than 10 bits (when cordic_square_root and square_root are configured in a way that
they generate the same number of root bits).

This module calculates the square root of a positive integer (radicand).
As the calculated root has a limited number of bits, it cannot represent the exact square root of the radicand (except when the radicand itself is a square number).

But the calculated root will always exactly fulfill this equation:

radicand = root**2 + remainder (0)

where remainder is always positive or zero and is made as small as possible.

To find the smallest remainder, the algorithm sets each bit of the root to 1 on a trial basis (starting at the most significand bit, MSB) and
checks the sign of the resulting remainder after each set bit.
When the remainder is still positive, then setting the bit was correct and can be kept.

The algorithm starts with root=0 and remainder=radicand, which is always correct (step i=0, see below).
At the next step the MSB of root is set and the sign of the new remainder is determined by calculating the difference radicand-root**2.
When the sign of the difference is positive, then setting the bit to 1 was correct, otherwise the bit must be reset to 0.
Afterwards this procedure is done with each next bit of root, until the last bit of root is determined.

Calculating the difference radicand-root**2 is not so easy, as the multiplication root*root is involved.
But this multiplication can be avoided:
The radicand has 2*n bits (if g_radicand_width is an odd number, an additional MSB with value 0 is added to the radicand),
where n is the number of bits of the root.
When in step i (i=1..n) of the algorithm a next bit of the root is determined, then
the previous result root[i-1] and remainder[i-1] already fulfill equation 0:

radicand = root[i-1]**2 + remainder[i-1] (1)

After step i the new determined root(i) must also fulfill equation 0:

radicand = root[i]**2 + remainder[i] (2)

As the sign of remainder[i] must be checked, the equation 2 is resolved to remainder[i]:

remainder[i] = radicand - root[i]**2 (3)

The term root[i] is equal to root[i-1] with the bit in evaluation set to 1 (at step i=1 the MSB is set) which is equivalent to this addition:

root[i] = root[i-1] + 2**(n-i)

Now in equation 3 the term root[i] can be replaced:

remainder[i] = radicand - (root[i-1] + 2**(n-i))**2 (4)
remainder[i] = radicand - root[i-1]**2 - 2*root[i-1]*2**(n-i) - 2**(2*n-2*i) (5)

The difference "radicand - root[i-1]**2" is known from equation 1 and has the value remainder[i-1]:

remainder[i] = remainder[i-1] - 2*root[i-1]*2**(n-i) - 2**(2*n-2*i)
remainder[i] = remainder[i-1] - ( root[i-1]*2**(n-i+1) + 2**(2*n-2*i)) (6)

Adding 2**(2*n-2*i) to the product root[i-1]*2**(n-i+1) means that bit (2*n-2*i) of the product must be set to 1.
The product root[i-1]*2**(n-i+1) means that root[i-1] must be shifted left in each step by a decreasing number (n,n-1,...,1) of bits.

Instead of implementing these different left shifts, root[0] is implemented as a signal with 2n bits, which is already shifted left by n bits and has all bits at value 0.
If in step i=1 the MSB of root has to be set, then root[1] is calculated by setting the MSB of root[0] (which has 2n bits) to 1.
Otherwise root[1] is identical to root[0].
Then root[1] is shifted 1 bit to the right for the next step, this is the same as shifting left by (n-2+1) bits in the next step.

In the next step when the next bit of root has to be set, the shifted position of root has to be taken into account.
The index of the next bit of root to be set is determined by 2*n-1-2*(i-1).
Again the new calculated root is shift right by 1 bit at last.

After the last step root was shifted right by 1 bit so often, that it is shifted to the correct position.

At each step the new calculated remainder can only be used, if it is positive, otherwise the remainder is kept unchanged.

Source code for HDL-SCHEM-Editor and HDL-FSM-Editor for module "square_root" and its testbenches (Number of downloads = 230 ).
With these files the schematics and the state-diagram can be loaded into HDL-SCHEM-Editor or HDL-FSM-Editor and can be easily read and modified:

hdl_editor_designs.zip

All module VHDL-files of the module "square_root" (Number of downloads = 226 ).
These files were generated by HDL-SCHEM-Editor and HDL-FSM-Editor:

VHDL_designs.zip

All testbench VHDL-files of the 2 testbenches of the module "square_root" (Number of downloads = 239 ).
These files were generated by HDL-SCHEM-Editor and HDL-FSM-Editor:

VHDL_testbenches.zip

Relocation hints:

You should extract all archives into a folder named "square_root".

Then you should load the toplevel (probably the testbench) into HDL-SCHEM-Editor.
When you navigate through the design hierarchy by a double click at each symbol,
HDL-SCHEM-Editor will find the submodules on your disk and ask if it can replace
the original path to the submodule by the new one at your disk.
After storing the changed modules the relocation of the source files is ready
(instead you could replace "M:/gesicherte Daten/Programmieren/VHDL/square_root" in all
"hdl_editor_designs/*.hse" source files by your path to this directory with your editor).

Now you can navigate through the design by HDL-SCHEM-Editor and generate HDL by HDL-SCHEM-Editor for
all modules except square_root_control, for which the HDL must be generated by HDL-FSM-Editor.
Of course there is only need for generating HDL, if you change something at the modules, because you can find the HDL in VHDL_designs.zip and VHDL_testbenches.zip.

If you want to simulate or modify the modules by HDL-SCHEM-Editor you also must adapt the information in the Control-tab of the toplevel you want to work on.
There you must define a "Compile through hierarchy command", an "Edit command", the path to your HDL-FSM-Editor and a "Working directory".

Change log:

Version 1.0 (13.01.2025):

Initial version

If you detect any bugs or have any questions,
please send a mail to "matthias.schweikart@gmx.de".

The module "square_root"a VHDL implementation of a root calculation

Ports:

Generics:

Relocation hints:

Change log:

The module "square_root"
a VHDL implementation of a root calculation