2D-DCT Implementation for JPEG encoder

I would like some suggestions for a simple architecture of a 2D-DCT for jpeg encoder.
@cfelton @josyb

@mkatsimpris do you mean in general or are there specifics that you have questions about? Is the implementation in the v1 reference more complicated that in needs to be (is that the thought)? A first step would be to review the v1 implementation and documentation and understand the its architecture. Additionally, v2 2D-DCT can be surveyed.

Have you identified how blocks (image blocks) are going to move in and out of the 2D-DCT? Is the interface the same or different that v1 reference? Does the interface meet the requirements of streaming encoding? How the data is fed into the 2D-DCT can have an impact on design decisions.

As regards the implementation of the 2d-dct in the v1 reference, the overall implementation is described in the documentation but the implementation of the mdct module is so complicated and without comments. There are a lot of control signals and rams. However in v2 design the implementation is more straightforward and simple to follow. Also the v2 reference uses a minimal flow design without the use of complicated controls signals and FSMs.

The ref implementation is rather frightening!

Coming at it fresh and with a myhdl hat on, here is how I would implement it, I’d be interested to know if there are any problems with the strategy:

Create a serial-in/serial-out 1d DCT block based around a streaming protocol like AXI stream. So, for an N length DCT, this would work by passing each input value to N multiply accumulate (macc) units along with python pre-computed coefficients (c(j,i), with j denoting the matrix row and i the column with i incrementing on each cycle), with each macc unit computing one value of the output after N steps (equivalently, each macc unit corresponds to one row of the DCT matrix). The N-length result is then serialised and pushed out.

At this point, you can put it back in memory in its transposed form, so it can be read out serially again for the second stage 1D DCT (to get the 2D dct).

Using a streaming protocol means that nothing really need to care about the latency of your processing pipeline - you can pull out and replace the actual internals of the 1D dct and everything will work fine even if the latency changes (or indeed the throughput) changes. This is an issue when it comes to differing DSP architectures across vendors. The ideal would be something like vendor specific MACC units to optimise the for each architecture.

By taking the transpose at the end of the pipeline, it will be properly aligned in memory for the next step, and also for the final output. This can be done really easily with myhdl :smile:.

I would suggest two memory buffers would be useful. One stores the initial data, the second stores the transposed 1D result, the final result is then put back into the first. A consequence of this is that you would need to flick the memory location based on which bit of the 2D dct you’re performing. Again, something like AXI stream allows the data to be packetised as it passes through so the position of your data can be tracked and the state machine should be pretty easy to implement.

If it’s useful, I have a partial (though enough for this case) AXI stream BFM implemented (both master and slave).

Hello Henry,

My actual design concern about 2d-dct is similar to your design. The design that you proposed is simple and could be implemented straightforward in myhdl.

As regards the vendor primitives like DSP units, if the produced code in vhdl or verilog satisfies the guidelines for DSP optimization then the multipliers will be implemented in DSP units. Instead, user designed primitives could be used which instantiate each vendors specific units.

As a start I will implement the 1D-DCT which takes serial input/output and write the tests.

An other issue is the internal fixed point representation, the fixed point representation of the outputs, rounding and truncation.

@mkatsimpris I fully acknowledge the design guidelines for DSP optimisations is how to go, however, the issue is as much about maximising performance.

e.g. if one just does a*b for a xilinx target, it will use a DSP unit, but it won’t be able to run at maximum speed. To do that one needs to allow for a 3 cycle pipelined DSP. AFAICT, this is different to Altera, so clearly there is not one vendor neutral way to design a fixed latency DSP pipeline that is optimally performant in call cases.

Of course, the fixed point representation is a concern - again, the spec could define what is required here and the internals would need to cope with it. What is the internal bit width for jpeg? I’d be surprised if most DSPs can’t easily cope with it.

For the coefficients the most designs use 14 bit fraction part and the outputs are 11 bits. Also the inputs are centered to 0 with a subtraction of 127.

I assume the initial input is limited to the range [0, 1], correct? In which case that should easily be handled by both Xilinx and Altera hardware, and I expect all the rest too.

Definitely DSP units can handle the fixed point representations. Xilinx can suppoort in virtex families 25*18 bitwidths.

Yeah, exactly. I’m slightly confused, are you suggesting you don’t need the streaming protocol?

Henry,

Merkourios and Vikram are working from the reference implementation on OpencCores.org which has an elaborated central controller dishing out enables.
Glad you brought that up, I prefer a proper pipelined system too. My experience lies with Avalon-ST which works just like AXI4 Streaming and I have, of course, my pipeline blocks in place.

Regards,

Josy

@josyb Yeah, Avalon stream and AXI4 stream are pretty much identical in the simple case AFAICT. For single channel with no errors one can pretty much connect them together (although the startofpacket has no analogue in AXI, which seems redundant to me anyway).

Don’t prematurely optimize, let’s not worry about utilizing specific primitives from specific vendors at this point unless you identify a compelling reason. Also, when figuring out the architecture don’t get hung up on which type of streaming interface to use. But we do need to know the general interfaces, the 2D-DCT needs an 8x8 block (the blocks size should be generic)? Need to know how the flow of data is handled in and out, need to know what type and size of buffers and/or memory are required.

My approach would be to outline the architecture of the v1 references, determine how to remove the “global controller” (local flow control if possible), and then determine if it can be simplified.

@cfelton I concur absolutely. That said, there is clearly an architectural decision as to whether to use a streaming model or something else.

I think the simplest and most straightforward approach for the design of the 2d-dct is the row-column decomposition as mentioned @hgomersall.

As regards the data flow in/out of the 2d-dct I think that there are two approaches. The first is serial data input and parallel output and the second parallel input and parallel output. Every decision has its own trade-offs.

If we decide to input the data serially and output them parallelly then we the utilization needs for each 1d-dct are 8 multipliers and 8 adder. But the latency is quite high.

On the other hand if we decide that the data are fed in parallely and output them parallely then the utilization needs for the 1d-dct are 64 multipliers and 56 adders but the latency is quite low.

Moreover additional design decisions have to be made for the storage of the outputs with respect to each flow decision we have made in each stage.

If I am wrong somewhere I would like suggestions and correction of my thoughts.

@hgomersall the “streaming” is more of a requirement and not a architectural decision. The jpegenc system and the reference designs all use processing blocks the data is streamed in and out (i.e. point-to-point data flow from block to block).

These are not coprocessors in a shared bus (e.g. processor system) that need to arbitrate a bus and go retrieve data from a certain location and put it to another.

Reference v1 has a peculiar global controller that controls the flow of data, it is a designs decision to replace the global flow control with local control (e.g. simple ready-valid like AXI4). This will allow the subblocks to be more modular and reusable (e.g. used in other systems).

@cfelton It seems we were discussing semantics. What you say I agree with.

@mkatsimpris It is possible to sequentially process each result with a parallel load, but there needs to be more control logic to handle it. The benefit IMO of a serial load is that the control logic is very simple - when a packet is completed, that’s the end of the processing (no need to monitor it). With a parallel read you either use loads of resources or you need to monitor the state of the processing with some control logic. Blocks that use a small quantity of resources will scale better into multiple parallel streams.

I see latency as largely irrelevant as long as everything is pipelined properly - it should be negligible in comparison to the number of cycles of processing.

@mkatsimpris, I agree with @hgomersall, latency should be less of an issue. This hasn’t been discussed but the encoder should be able to stream (real-time encode) a high quality video stream, 1920x1080 or higher.