JPEG Encoder Data Flow

As the project grows there are some concerns about the data flow in the encoder. I will try to describe my part of the encoder (frontend) and how the data are handled in the color space conversion and 2D-DCT module.

Firstly, the pixels are inserted serially one by one for each row in 8x8 blocks of the image into the color space conversion module. The color space conversion module outputs the signals Y, Cb, Cr serially into the 2d-dct module. There are three separate 2d-dct modules one for each signal (Y, Cb, Cr) which work in parallel. These modules take serially the inputs from the previous module and output after some cycles the 8x8 block in parallel. Thus, after the conversion is done there are 3 8x8 blocks ready to be processed from the back-end part of the encoder.

There is no need yet for complex FSMs and data storage, as the data flow can be continuous in each of the modules without any conflict in the 2 modules (Color Space Conversion, 2D-DCT).

Moreover, the zig-zag module could be integrated in the 2d-dct module. The order of the 2d-dct outputs should be changed in order to have a zig-zag form.
@cfelton @josyb @Vikram9866 @vikram @hgomersall

Uhm, what’s the concern? Is there something to discuss?

The topic started in order to get some comments about the data flow which I described. Moreover, if there are any problems in the collaboration between the front-end and back-end part of the encoder, we could discuss them.

@mkatsimpris thanks for starting this conversation, one to communicate with those involved and two to document what currently exists.

In most applications (that I foresee) the jpegenc will be real-time compression a videostream. On the frontend will need to be a line/block buffer that buffers N lines (N is nominally 8, the block size should be a top-level parameter). From the line buffer the blocks are streamed into the color conversion and to the DCT.

@mkatsimpris as you indicate, three blocks are available out of the DCT, how are these currently available? Streamed out of a FIFO or addressed out of memory? How clock cycles does it take for the block conversion?

@mkatsimpris are you suggesting that it is optimal to embed the zigzag in the DCT instead of a separate subblock? If so, can it be a top-level parameter to have zig-zag outputs or normal 2D-DCT outputs?

There are three 2D-DCT modules which work in parallel. They take each component (Y, Cb, Cr) separately. For the first block it takes 89 cycles to get converted in the 2d-dct module, and for the other blocks it takes 64 cycles. The blocks are available after all the inputs (plus some extra cycles for latency) are inserted in the 2d-dct module. So there are not streamed out of fifo or addressed out of memory.

As regards the zigzag module, the integration in the 2d-dct isn’t matter of optimality, I think it’s more about writing less code. We can use a multiplexer in the 2d-dct top level module to specify the order of the outputs, zig-zag or not.

I am missing something? What is the output interface of the 2D-DCT? How does the downstream subblocks get the data? It is not streamed and it is not pulled? What other option is there?

The outputs of the 2d-dct module are 64 signals which are the values of the 8x8 matrix. Thus, for the 3 2d-dct modules there are 3*64 signals (3 8x8 blocks).These signals are valid for 64 cycles so the other blocks in the backend part can process them in the 64 cycles.

@mkatsimpris, @vikram

There is a disconnect between the output of the frontend and the input of the backend. Ideally this would have been resolved before we started implementing the blocks, it was one of the first action items that was identified at the beginning of the program.

Currently, as stated in this thread, the output of the frontend is a single 388 (a block being 8x8) parallel bus. All the outputs are ready in parallel. The first subblock in the backend, the quantizer is expecting a stream of pixels.

The parallel implementation of the DCT is the highest performance but it requires quite a bit of resources, making it impractical for medium size FPGAs.

In the near-term, to complete a functional encoder we need to create a module in the DCT that will multiplex out the 3*block_size values and some flow control needs to be added to hold the DCT inputs from getting new data until all the outputs are muxed out.

Essentially, a PixelStream is the interface between all the subblocks in the jpegenc (with the one exception being the dct->zigzag).

The only loose requirement that was given for the jpegenc was to encode HD video real-time. This should be reviewed and determined if there are cycles that can be used to pipeline the DCT (more time less resources).

@cfelton i haven’t taken into consideration that we will be using mid sized FPGAs. This requirement wasn’t there in the start of the program. The only requirements that we had, were the minimum flow and HD streaming.

I will try to change the architecture of the dct in order to use less resources and output the values in a stream. However, I dont know if the time remaining will be enough to complete all the milestones which I described in the proposal.

@mkatsimpris Yes, this (medium size FPGA) was not explicitly defined at the beginning of the project it was implicit via the reference designs.

The plan should be to complete a working jpegenc with the current DCT architecture but adding the DCT output mux and required flow control. The DCT interfaces will be PixelStream in and PixelStream out. Once this is in place it will be straightforward to swap in compatible implementations. With the test suite and the full encoder verification it will be easy (and fun!) to swap out different implementations.

If we run out of time and can’t include a different DCT implementation, that is fine. In this weeks blog could you outline the remaining tasks you want to complete (those you mentioned above from the proposal).

high-level remain tasks:

  1. Working encoder with existing blocks, requires a DCT output mux and appropriate flow control.
  2. Verification of the working encoder.
  3. Optimized subblocks.

The flow being discussed, should of been one of the first things
that was specified and designed in the project. For this project
the students were learning a bunch of new things so I unintentionally
let it slide. I should have pushed more to have the architectural
definition … bygones are bygones.

As I understand it, architecturally, we currently have:

####1. Input video stream (pixel stream)
Pixel inputs at some rate:

a. 640 x 480 @ 60Hz 18.432 Mpx/s
b. 720 x 480 @ 60 Hz 20.736 Mpx/s
c. 1280 x 720 @ 60 Hz 55.296 Mpx/s
d. 1920 x 1080 @ 60 Hz 124.416 Mpx/s

####2. Row buffer
Buffers N rows, where N is the number of rows in the block size
(8x8). Outputs image blocks at the same rate as the input.

####3. Color conversion
pixel-by-pixel, maintains the input rate (no processing overhead)

####4. DCT
The DCT works on blocks, a block is streamed in at the input rate.
A block will take nblk_cycles (64 for 8x8) + nproc_cycles.
Also, the DCT needs to be run 3 times for each color component.
The total processing overhead is:

p = nproc_cycles + 2 * (nblock_cycles + nproc_cycles)

(the above isn’t 100% correct, I believe the nproc_cycles is only
required on the first block? The overhead is the cycles in
addition to the cycles required to clock in the block)

In the DCT case, the DCT will need to be operating at a clock rate
that is p times greater than the input rate to maintain the
throughput.

####5. Zig-Zag
Works on a block (a parallel vector, 64 items) and generate the
output. This subblock has no processing overhead.

The output of the dct-zz will be serialized …

####6. Quantizer
Accepts pixel stream at the input rate (more details are needed,
when and why it pauses if needed)

####7. RLE
(details needed for inputs, processing, and outputs)

####8. Huffman encoder
(details needed for inputs, processing, and outputs)

####9. Byte stuffer
output jpeg-stream, includes header etc.
(details needed for inputs, processing, and outputs)

Ignoring some of the missing details for the backend subblocks,
the current architecture working backwards could support an input
rate of roughly 500 Kpx/sec with a clock of 100 MHz.

@mkatsimpris, @Vikram is this correct?

cc: @josyb, @nikolaos.kavvadias, @hgomersall, @jos.huisken

1 Like

@cfelton you are right in the most part of the description. The input in the encoder can get serially instead for buffering N rows. Is there any problem with that?. We can have a RAM in order to store the block and a ready signal. The ready signal will go True when the encoder needs new data and will store the values in the ram.

As regards the 2d-dct the inputs are fed serially. So it works with the input clock rate. The clock cycles required for the dct are the latency from the pipeline register plus 643 cycles to process the 3 blocks. So the end of the total processing in the frontend part for an image is the latency from pipeline registers plus 192blocks cycles. The current design takes 292 cycles (192+100) to process the first block and then 192(64*3) more for each next block.

The outputs from the frontend part are serialized from the output signals of the zig-zag (it is implemented). So the backend part can take each signal serially and process it.

If we want to process HD video we have to reach 60 FPS. 60 FPS means processing time for each frame 0.0166 seconds. If we have a 1920x1080 image means 32400 blocks to process. Each processing of the block will take 192 cycles plus the latency. That makes 192*32400 + latency=6220800 + latency. We can assume a latency of 200 cycles. So, the processing cycles for the frame are 6221000. If we want to achieve 60fps we must have a clock with freq=6221000/0.016=374mhz.

@mkatsimpris, the input buffer needs to buffer N row otherwise you cannot get an 8x8 block (in this case N=8) to perform the calculations, you need N rows before you can get a block. To avoid the buffering you would need to change the video source not to output row serialization but rather block serialization which no video sources do.

When we refer to row, it is a row of the image or a row of the block?

Backend Flow:

Quantizer:

Right Now I assume that Front-end stores data in a FIFO. I access data from the Input FIFO . Once a block(8x8) is written in the Input FIFO then Quantizer-top takes input data serially from Input FIFO, process it and store it in a Output FIFO.

Divider used in Quantizer if pipelined with 4 stages.

(As Front-end is already storing the Pixels and streaming serialy, I may need to remove the input FIFO and directly take from the Front-end.)

Number of clocks for a 1YCbCr for Quantizer : 64*3 + 3 = 200 cycles(roughly).
(1YCbCr block = Y, Cb, Cr components = 3 blocks)

Once a block is stored in input FIFO of the quantiser start signal asserts(data-valid from front end).
Once a block is complete the quantizer module sends ready signal to RLE Module

Run Length Encoder:

Run Length Encoder takes input from a block stored in output FIFO of Quantizer. Top Module takes input serially from the FIFO. Module Processes it and stores it in Output FIFO.

It takes 64 clocks for one block Ignoring a latency of 6.

Same start and ready signal controls as described in Quantizer.
There is a reset dc registers on completing one Y, Cb, Cr blocks (they hold previous data for last YCbCr components).

Huffman Module:

Huffman Module takes input form the FIFO in which RLE Module stores data. It processes the data and stores it in a Output FIFO.

It takes around Three cycles for each input(it needs to check its corresponding VLC Codes form the Huffman Tables and do some processing). The Code present in the input FIFO may vary because of Variable Length Encoding of previous block.

So I take oen of the worst case scenario(very few zero exists in input block). It will take around 30*3 = 90 clocks for a block(30 is some random number I chose).

ByteStuffer:

It takes the input from the FIFO in which the Huffman stores data and stores the computed data in a output RAM.
(3 cycles per pixel)number of inputs in FIFO(will be less than 40 as described as one of worst case).
20
3(60 cycles)(20 is juss a rough number I took).

Functioning:

All the FIFO’s present in the design are double FIFO’s.

The connection is in such a way that when Byte stuffer processes block x. Huffman Processes block x+1. RLE Processes block x+2 and Quantiser processes x+3.

Use of double FIFO’s. when one module write in one numbered FIFO. Next module will read the data of the block processed and stored in two numbered FIFO. In this way all the blocks work paralelly.

So, latency = time taken to process one 8x8 block = 280 cycles(roughly).

Once all the FIFO’s have data available (kind of pipeline is full). Parallel processing starts and it takes 64 cycles for 64 pixels after that (roughly). (huffman wont take 90 for every block. So, I take 64 as the maximum clock cycles a intermediate block use).

So, total cycles = 64*(number of 8x8 blocks) + latency

PS: Calulations are rough and I have made few estimations close to the desired value.

@cfelton. Please give it a review.

@mkatsimpris a row in the image, from the perspective of the input buffer (row buff).

I could use some clarification on this statement, when you say it takes 64 clocks cycles for an 8x8 block [1] plus latency. In this statement you are stating the subblock is fully pipelined? The subblock can process a new sample (pixel) on every clock cycle and the first processed sample is available 6 clock cycles after the first input if valid?

[1] the block size should be fully parameterizable.

@cfelton. Yes, that’s exactly what I mean. I will re-edit the block size in term of its size

As we have been discussing, we are designing this on simple a ready-valid flow control. We have a number of subblocks and the flow control between the subblocks is the specified as the ready-valid flow.

The flow control is needed because (as discussed above) different subblocks have different processing requirements: need a block before processing, variable processing cycles, etc.

By designing each subblock this way we are decoupling each subblock, the good is that each subblock can easily be tested standalone and it will be easier to swap out different implementations (e.g. lossy vs. lossless). The downside is that cross-boundary optimizations, probably are not possible. Although, the design will be easier to experiment with and change - it might not be the most efficient.

@vikram pointed out a couple resources that he has been using for the ready-valid flow control.

  1. EECS150: Interfaces: FIFO
  2. Link-Level Flow Control and Buffering