GSoC #5: Sum Problem of Fixed-Point Type


The original blog post is here.

The Problem

If you want to sum up several fixbv fixed-point variables, what will you do?

If your answer is as follows:

a + b + c + d

It seems to be correct but there would be a trap on the bit width.

Bit Width Changing While Adding

Here we use (total_width, integer_width, fractional_width) to indicate the bit width of the fixbv variable, its integer part, and its fractional part.

As written in MEP-111, if two fixed-point variables are added, the fractional part of the result should be the longest of the operands; and the integer part of the result should be the longest plus one, in order to avoid overflow.

This is an example of adding a variable in format (8, 3, 4) and another variable in format (8, 0, 7):

+ ssss.fffffff

We can see that the first operand has been added zeros in the tail of the fractional part, and the second operand has been added sign bits in the beginning, in order to perform point alignment.

A Critical Example

Considering the following situation:

>>> a.format
(16, 4, 11)
>>> b.format
(16, 4, 11)
>>> c.format
(8, 3, 4)
>>> d.format
(8, 7, 0)

If we add like a + b + c + d, the format of the final result will be:

 (16, 4, [11])+(16, 4, [11])+(8, 3, [4])+(8, 7, [0])
=(17, 5, [11])+(8, 3, [4])+(8, 7, [0])
=(18, 6, [11])+(8, 7, [0])
=(20, 8, [11])

However, if we add in d + a + b + c, the result should be:

 (8, 7, [0])+(16, 4, [11])+(16, 4, [11])+(8, 3, [4])
=(20, 8, [11])+(16, 4, [11])+(8, 3, [4])
=(21, 9, [11])+(8, 3, [4])
=(22, 10, [11])


If we use the assignment like:

x = a + b + c + d

this problem should be consider. But, if we have already defined the format of x and perform the operation as

x[:] = a + b + c + d

it would be safer because the format of x has already been decided. However, rounding and overflow must be considered if necessary.

A New Sum Function for Fixed-Point Variables

So here, in order to make the result unique in different orders, a new function fxsum is needed in this theme. The usage is similar to built-in sum:

fxsum(iterable, start=0)

fxsum requires an iterable parameter containing fixbv instances, and returns the sum as a fixbv variable. The width of the fractional part would be the longest in the iterable, while the width of the width of the integer part is the longest plus ceil(log(N, 2)), in which N is the width of the longest integer part.

Instead of using a + b + c + d, we could write

fxsum((a, b, c, d))

to avoid ambiguity of bit width in the result.

As a result, in the previous example, we could obtain that

>>> fxsum((a, b, c, d)).format
(21, 9, [11])