3.1 Parallel Architecture
                  
                     HEVC is more suitable to parallel processing by further reducing data dependency compared
                     with H.264/AVC. The proposed architecture improves the performance by employing parallelism,
                     a multi-stage pipeline with parallel filters. For parallel filter processing, a special
                     memory allocation algorithm is invented for efficient on-chip SRAM access.
                     
                  
                  
                     Fig. 3 shows the overall block diagram of the proposed architecture. The proposed architecture
                     consists of the following components: BS calculator, parallel dE calculator, β and
                     tc calculator, parallel filter and SRAM. SRAM is divided into 16 blocks. In SRAM,
                     the current pixels of a 32x32 block and their left and above neighboring pixels are
                     stored. Each SRAM block stores 8 4x4 pixel blocks. The data is read from the SRAM
                     blocks for filtering computation and stored back to the SRAM blocks after filter computation.
                     The read/write SRAM block positions are determined by the proposed memory allocation
                     algorithm.
                     
                  
                  
                     
                     
                           
                           
Fig. 3. Block Diagram of the Proposed Architecture
                         
                     
                  
                  
                     The proposed architecture works as a 4-stage pipeline as shown in Fig. 3. The circled number in Fig. 4 denotes the pipeline stage. In stage 1, the data is read from SRAM and the BS calculation
                     is performed. In stage 2, β and tc calculation, and 8 parallel dE calculations are
                     performed. In stage 3, 4x4 block filtering is performed in parallel. In stage 4, filtered
                     results are stored back into SRAM blocks.
                     
                  
                
               
                     3.2 Memory Allocation Algorithm
                  
                     The data access in the on-chip SRAM without conflicts between the parallel filters
                     is very important to maintain high performance. For the proposed parallel filter architecture,
                     we show an efficient memory allocation algorithm to avoid such conflicts. By using
                     this algorithm, each filter is guaranteed to read and write from 2 out of the 16 SRAM
                     blocks without conflicts between them in parallel.
                     
                  
                  
                     
                     
                           
                           
Fig. 4. 32ⅹ32 Pixel Block (Y, Cb, Cr)
                         
                     
                  
                  
                     The filtering boundary of a deblocking filter can be either vertical or horizontal.
                     As shown in Fig. 3, the proposed architecture has 16 SRAM blocks. Fig. 4 shows the 32x32 block waiting for the Deblocking Filter operation in parallel. Each
                     named block is a 4x4 block. Our goal is to allocate these 4x4 blocks into the 16 SRAM
                     blocks such that no data conflicts occur between the filters during the parallel filter
                     computation. 
                     
                  
                  
                     As explained before, each SRAM block stores 8 4x4 pixel blocks. In Fig. 4, each block is named by 2 letters such as AA or BC. The block with the same first
                     letter belongs to the same vertical filtering. For example, In Fig. 4, blocks AE’s, AA, AB’s, AC’s and AD’s belong to the group for same vertical filtering,
                     and all of them have same first letter ‘A’. Fig. 13 is an example final data allocation. The blocks with initial ‘A’ are stored in the
                     first row of the SRAM blocks evenly spread across them without any overlaps. Therefore,
                     the vertical filtering of this data group can be performed by 8 filters accessing
                     different SRAM blocks without conflicts at the same time.
                     
                  
                  
                     Our proposed memory allocation algorithm shows a deterministic way to decide the SRAM
                     location by using two letters on each 4x4 block. The second letter of the block name
                     is used to denote the same horizontal filtering group.
                     
                  
                  
                     The filtering order of HEVC is (1) the vertical filtering shown in Fig. 5 done first, and then (2) the horizontal filtering shown in Fig. 6 next. For example, the blocks AA, AB, AC, AD and AE are used for same vertical filtering,
                     and BA, BB, BC, BD and BE are used for another vertical filtering as shown in Fig. 5. 
                     
                  
                  
                     
                     
                           
                           
Fig. 5. 32ⅹ32 Pixel Block - Vertical Filtering
                         
                     
                  
                  
                     
                     
                           
                           
Fig. 6. 32ⅹ32 Pixel Block - Horizontal Filtering
                         
                     
                  
                  
                     For horizontal filtering, AA, BA, CA, DA and EA belong to the same filtering group,
                     and AB, BB, CB, DB, and EB belong to another filtering group. Same filtering group
                     must be stored into different SRAM blocks to avoid the data access conflicts during
                     parallel filter computation. By applying these rules on the 32x32 pixel block using
                     8 alphabet letters as shown in Fig. 4, we can obtain the goal allocation as shown in Fig. 13 eventually. All the pixel blocks are stored into SRAM blocks such that no access
                     conflicts occur during the filter computation.
                     
                  
                  
                     
                     
                           
                           
Fig. 7. Memory Allocation Step 1
                         
                     
                  
                  
                     Fig. 7 ~ 12 illustrate a deterministic algorithm to reach a memory allocation result shown in
                     Fig. 13 without any access conflicts. Firstly, as shown in Fig. 7, allocate B-, C-, D- blocks with 4 space gaps between the rows. Note that all the
                     blocks are allocated into SRAM blocks (S_0~S_15) such that each SRAM has no conflicting
                     letters for first and second positions. Therefore there is no conflict in the SRAM
                     access. After this allocation, there are A- and E- blocks waiting for the allocation
                     as shown in Fig. 8.
                     
                  
                  
                     
                     
                           
                           
Fig. 8. Memory Allocation Step 2
                         
                     
                  
                  
                     
                     
                           
                           
Fig. 9. Memory Allocation Step 3
                         
                     
                  
                  
                     Secondly, as shown in Fig. 9, for the SRAM column with -A assigned like SRAM_1, put AE in the first row. For the
                     column with -E, put EA in the fifth row. As shown in Fig. 10, fill in the remaining letter on the column with only one empty slot.
                     
                  
                  
                     Finally, as shown in Fig. 11, fill in the remaining two rows with AA and EE for one column, and with AE and EA
                     for the other three columns. After all the luminance blocks are allocated, Cb and
                     Cr blocks are allocated similarly. As shown in Fig. 12, allocate the GG blocks in the SRAMs, and then allocate FH and HF blocks. The remaining
                     columns are allocated such that the first and second letters are not overlapped as
                     shown in Fig. 13.
                     
                  
                  
                     
                     
                           
                           
Fig. 10. Memory Allocation Step 4
                         
                     
                  
                  
                     
                     
                           
                           
Fig. 11. Memory Allocation Step 5
                         
                     
                  
                  
                     As long as the letters of each position are not overlapped in a column, there can
                     be many solutions. In this paper, we showed a deterministic algorithm to find a working
                     solution quickly.
                     
                  
                  
                     
                     
                           
                           
Fig. 12. Memory Allocation Step 6
                         
                     
                  
                  
                     By using the proposed memory allocation algorithm with an 8 filter architecture, the
                     execution time to perform filter oper- ations on 96 boundaries for a 32x32 pixel block
                     is 12 clock cycles. To add the 6 cycles of starting and ending delays of pipelines,
                     the total execution time is 18 clock cycles.
                     
                  
                  
                     
                     
                           
                           
Fig. 13. Memory Allocation Step 7
                         
                     
                  
                  
                     The memory allocation algorithm can be used for other numbers of filter parallelism
                     in a similar fashion. If the number of SRAM is increased, the number of parallel filters
                     can be increased proportionally. For example, 16 parallel filters can be used for
                     32 SRAM blocks.