The original blog post is here.
If you want to sum up several
fixbv fixed-point variables, what will you do?
If your answer is as follows:
a + b + c + d
It seems to be correct but there would be a trap on the bit width.
Bit Width Changing While Adding
Here we use
(total_width, integer_width, fractional_width) to indicate the bit width of the
fixbv variable, its integer part, and its fractional part.
As written in MEP-111, if two fixed-point variables are added, the fractional part of the result should be the longest of the operands; and the integer part of the result should be the longest plus one, in order to avoid overflow.
This is an example of adding a variable in format
(8, 3, 4) and another variable in format
(8, 0, 7):
We can see that the first operand has been added zeros in the tail of the fractional part, and the second operand has been added sign bits in the beginning, in order to perform point alignment.
A Critical Example
Considering the following situation:
(16, 4, 11)
(16, 4, 11)
(8, 3, 4)
(8, 7, 0)
If we add like
a + b + c + d, the format of the final result will be:
(16, 4, )+(16, 4, )+(8, 3, )+(8, 7, )
=(17, 5, )+(8, 3, )+(8, 7, )
=(18, 6, )+(8, 7, )
=(20, 8, )
However, if we add in
d + a + b + c, the result should be:
(8, 7, )+(16, 4, )+(16, 4, )+(8, 3, )
=(20, 8, )+(16, 4, )+(8, 3, )
=(21, 9, )+(8, 3, )
=(22, 10, )
If we use the assignment like:
x = a + b + c + d
this problem should be consider. But, if we have already defined the format of x and perform the operation as
x[:] = a + b + c + d
it would be safer because the format of
x has already been decided. However, rounding and overflow must be considered if necessary.
A New Sum Function for Fixed-Point Variables
So here, in order to make the result unique in different orders, a new function
fxsum is needed in this theme. The usage is similar to built-in
fxsum requires an iterable parameter containing
fixbv instances, and returns the sum as a
fixbv variable. The width of the fractional part would be the longest in the iterable, while the width of the width of the integer part is the longest plus
ceil(log(N, 2)), in which
N is the width of the longest integer part.
Instead of using
a + b + c + d, we could write
fxsum((a, b, c, d))
to avoid ambiguity of bit width in the result.
As a result, in the previous example, we could obtain that
>>> fxsum((a, b, c, d)).format
(21, 9, )