Low-Cost Uniﬁed Pixel Converter from the MIPI DSI Packets into Arbitrary Pixel Sizes

: The advance in semiconductors and image processing technologies has signiﬁcantly improved visual quality, especially on mobile consumer devices. The devices require a low-cost and high-bandwidth interface to support various pixel formats on high-resolution displays; thus, the MIPI Alliance has proposed the industry-standard MIPI DSI (Display Serial Interface). The traditional implementation of DSI Rx has classiﬁed an incoming packet into three components, a header, a payload, and a checksum, by aligning the packet with the DSI PHY input width and then converting the payload into pixels. Its two-step approach has resulted in high implementation costs for supporting various pixels. This paper proposes a low-cost uniﬁed pixel converter, classifying each component and aligning the input payload into various pixel formats in only one step, thus achieving less area and lower power consumption overhead. Two terms are newly introduced for the proposal: a base and a remainder. The base size is the same as the DSI PHY input, and a remainder is a rest after the bases are aligned. The one-pixel size equals a sum of one or more bases and the remainder. The introduction allows us to implement the converter very straightforwardly due to the exact size of the base and the D-PHY input. Additionally, our approach does not require considering the header separately from the payload since the header size equals the base size. Therefore, the header detection unit is eliminated, thus reducing the complexity further. The proposed design was functionally veriﬁed in FPGA and synthesized through the Samsung 65 nm standard cell library. The synthesis result showed that the proposed design reduced by 25.7% in the area and 38.6% in the power consumption from the traditional design.


Introduction
The advanced semiconductors and image processing technologies have improved the visual quality of consumer devices significantly [1,2]. In particular, the quality of mobile devices, such as smartphones and AR/VR (Augmented Reality and Virtual Reality) devices, is dramatically advancing every day. Most smartphone companies adopt megapixel cameras and high-resolution displays for generating and producing high-quality images [2]. Similarly, the AR/VR devices have high-resolution displays of 120 fps (Frames Per Second) or higher refresh rates for reducing the discrepancy between what eyes see and what a brain perceives.
The higher visual quality involves more data generated from an image source, such as a camera, processed on computing devices and delivered to display devices quickly, thus requiring high computational throughput and bandwidth. For example, the conventional FHD (1080p, 60 fps, 24-bit RGB) display requires a bandwidth of 3.58 Gbps and a highresolution UHD display (2160p, 60 fps) requires 14.33 Gbps, which is four times larger than the conventional FHD [3]. More data movement implies more power consumption for the transmission; thus, it is essential to implement a power-efficient interface that supports 2. Background 2.1. MIPI Display Serial Interface MIPI DSI [4] is a low-power and high-performance interface between a mobile device's application processor and a display module, established by the MIPI Alliance. The interface has been adopted to mobile-influenced market devices, such as smartphones and tablets [5], and recently to wearable devices, such as AR/VR devices [12].
A transaction in the interface consists of one or more packets. There are two types of packets: a 4-byte short packet for the interface configuration and a long packet with a variable length of 6∼65,541 bytes for video data transmission, as shown in Figure 1. The short packet consists of a 1-byte data identifier (DI) indicating a packet type, a 2-byte data, and a 1-byte ECC for error correction. The long packet consists of a 4-byte header (a 1-byte DI, a 2-byte word count, a 1-byte ECC, such as the short packet), a 0∼65,535-byte data payload, and a 2-byte footer for a checksum for the data payload. The word count indicates the payload size in bytes, where a packet contains video data for a display. A short packet is used for sync timing, such as vsync (vertical sync) and hync (horizontal sync), or an R/W command for internal register control. Meanwhile, a long packet that can load a variable-length payload is used for pixel stream data or vertical/horizontal blanking.
To output an image, a sync signal must be periodically transmitted in addition to the image data. For example, after transmitting one frame of one horizontal video line, the hsync signal should be transmitted. In response to these requirements, the host composes and transmits a mixture of a long packet for the pixel stream and a short packet for the hsync.

MIPI DSI Standard Pixel Formats
The MIPI DSI standard defines several pixel formats of RGB and YCbCr, and Table 1 shows them with their pixel and byte alignment information. As shown in the table, one pixel is not byte-aligned in some formats, such as 18-bit RGB and 36-bit RGB; thus, in the formats, multiple pixels need to be bundled together (multi-pixel-aligned data) for being delivered to a display module. Figure 2 represents a long-packet transmission for the packed 18-bit RGB pixel format, consisting of groups of R, G, and B 6-bit data for 18-bit pixels. The DSI Rx sends 4 pixels at a time to a display, i.e., 9 bytes (4 pixels × (18 bits/pixel)/(byte/8 bits)). From now on, the multi-pixel-aligned data defined in Table 1 is referred to as a "pixel".

Related Work
There is a study related to MPEG2 alignment [13]. In this study, compressed variablelength packets should be processed in 1-byte or 4-byte units. The point of fixing the header parsing position through the alignment is the same as in our study. However, after aligning according to the input lane, the next step (alignment for audio or video data processing or decompression processing) is not covered. Our research proposes a unified pixel converter that can the perform-lane alignment and pixel-format alignment of various sizes at once, significantly reducing overhead compared to the existing structure that requires two steps of alignment or two separate processing units.
Among MIPI IPs, an IP aligns byte data to pixels for the CSI or DSI interface [14]. This IP performs a pixel conversion to RGB666 and RGB888 format in the case of DSI and 8-bit, 10-bit, RAW12, and RGB888 format conversions in the case of CSI, according to the number of lanes. Our research targets all formats defined in the DSI standard, including 30-bit RGB, and implements the DSI IP with a low area overhead through an effective alignment method.

A Baseline Design
In the DSI protocol, a host makes one transaction by mixing 4-byte short packets and long packets with a variable length of up to 64 KB, as described in Section 2.1. When a current packet is a long packet with a variable-length payload, the next incoming packet can start at any position of the 4-byte D-PHY input. Therefore, to correctly recognize its header, payload, and checksum, and send the aligned pixels to a display, an incoming packet's start position should be carefully defined and dynamically changed by the previous packets' payload sizes.
A baseline design for processing the packet alignment is shown in Figure 3, consisting of three components: a packet aligner, a command decoder, and a pixel aligner. The packet aligner aligns the D-PHY inputs in 4-byte units. Then, using the aligned input, the command decoder decodes the header to recognize the pixel format of the payload and provide it to the pixel aligner, rearranging the 4-byte-aligned payload into pixels for the display. The packet aligner classifies an incoming packet into a 4-byte-aligned header, payload, and checksum. The aligner includes two 4-byte input buffers (cur_ibuf and prev_ibuf ), two 4-byte output buffers (header_obuf and payload_obuf ), and two select multiplexers (a header selector and a payload selector). At every cycle, new incoming 4-byte data from D-PHY is input to the current input buffer (cur_ibuf ), and 4-byte data stored at the current input buffer is moved to the previous input buffer (prev_ibuf ) so that the two selectors process the input data of two cycles, i.e., 8 bytes. The header selector selects a 4-byte header from 8 bytes of two input buffers, thus storing the selected header into the header output buffer (header_obuf ), optionally enabling the ECC-correction module to perform error detection and correction before storing. The payload selector selects a 4-byte payload from the 8 bytes input from the two input buffers and stores it into the payload output buffer (payload_obuf ). Figure 4 shows how the payload selector chooses 4 bytes from the two input buffers, and two buffers are right-shifted by the payload_pos value, the payload selector's selection value, for the 4-byte alignment.
The header selector has the same design as the payload one. The header_pos value, the header selector's selection value, is initially set to zero when a transaction starts (the transaction starts from lane zero according to the standard), and it changes according to the previous packet sizes. Since one packet length is a sum of the header, the payload, and the checksum, the header position of the next packet i + 1 from a current packet i can be expressed as the following, where l is the input size in bytes and s(·) represents the size of(·): Equation (1) can be shortened to Equation (2) since the header size is the same as the D-PHY input size, l, for all packets. Additionally, Equation (2) implies that payload_pos equals header_pos and does not change within the current packet i.
The pixel aligner consists of a demultiplexer (a pixel selector) and a 60-byte pixel output buffer, called pixel_obuf. The pixel selector locates an aligned 4-byte input payload into pixel_obuf depending on the selector's selection value, pixel_obuf_pos, starting from zero and increasing by one until it reaches the value predefined by each pixel format. For example, if the pixel size is 12 byte, pixel_obuf_pos has values of zero to two. The format decoder provides the maximum value of pixel_obuf_pos, and the input payload is stored from the 4-byt LSB of the output buffer.
Because pixel_obuf has to be aligned to both the pixel sizes (multi-pixel-aligned sizes) shown in Table 1 and the input payload unit (4 byte), we used 60 bytes as the output buffer size, the smallest common multiple, considering the worst-case design. It results from the case of the 15-byte multi-pixel alignment size from the 30-bit RGB format. In this case, the value pixel_obuf_pos repeats from 0 to 14. Therefore, the worst-case design makes the pixel aligner complex, which requires a large output buffer and a pixel selector.
The baseline design has shortcomings: (1) The two independent alignments are required, incurring a significant overhead. (2) The different pixel format requires a different pixel alignment, thus requiring a circuit corresponding to each pixel format. It loses the design flexibility to support multiple pixel formats. In order to support the flexibility, the most oversized pixel alignment design should be considered for the worst case [14].

Overall Design
The baseline design classifies one packet into a header, a payload, and a checksum; our unified pixel converter does this into one or more bases and a remainder. The interface input size determines the base size (here, the number of D-PHY lanes, l, i.e., four) and the remainder size (between zero and base − 1 bytes, i.e., α). This way, we can represent any data size as l × n + α bytes where n ≥ 1, 0 ≤ α < l. Therefore, our design can support any size of pixel format. We only need to increase the value of n to support larger pixels. We call 4n as n 4-byte bases and α as the remainder. For example, a 15-byte pixel consists of three 4-byte bases and a 3-byte remainder.
Our base and remainder classification allows easy pixel alignment, which is thus sufficient with a much smaller pixel output buffer than the baseline design. Plus, because the header is also 4 bytes, the same size as the base, and the payload follows the header, one base selector can handle both the header and the payload bases. This characteristic eliminates the header alignment logic for the header selection. Figure 5 shows our unified pixel converter. There are two input buffers, cur_ibuf and prev_ibuf, to store inputs from D-PHY for two cycles, the same as the baseline. Two output buffers, a header output buffer (header_obuf ), and a pixel output buffer (pixel_obuf ), store a header and a pixel, respectively. The pixel output buffer has a size of 4 to 4n + 3 bytes using a maximum pixel size to be aligned. In this paper, we set the size as 15 bytes, the largest pixel size covered by the standard shown in Table 1, which is much less than 60 bytes in the baseline design. Additionally, there are three selectors: the base selector selects the header and the 4-byte bases (4n) from the two input buffers, the remainder selector selects the remainder (α), and the base-pixel selector puts the selected base into the proper position of pixel_obuf. Even when we need to support pixels larger than 15 bytes, our design can benefit from the baseline by not modifying the alignment unit and only supporting an output buffer equal to the size of the largest pixel. Suppose we want to support a pixel set containing multiple prime-number-sized pixels. Our design needs only the largest pixel size buffer regardless of the set's configuration. On the other hand, the baseline needs the output buffer to be the same size as their least-common multiple, much larger than the maximum-sized pixels, and larger pixel-selector multiplexers.

Selectors for the Payload and the Pixel Alignments
This section explains how the base, the remainder, and the base-pixel selectors work with their position values. Figure 6 shows the 15-byte pixel output buffer, pixel_obuf, storing the aligned three bases and one remainder. For simplifying our implementation, we align the three MSBs of the output pixel buffer with the cur_ibuf buffer for the remainder and the rest of the output pixel buffer with the prev_ibuf buffer for one and more bases. The base-pixel selector, a demultiplexer, stores the bases to pixel_obuf by the pixel_obuf_pos value. Since the remainder is aligned in cur_ibuf, the remainder goes directly to the most significant 3 bytes of the pixel output buffer, i.e., (pixel_obuf [14:12]), thus not requiring any selector for the remainder. This is our unique approach for optimization, i.e., eliminating the multiplexer for the reminder to the output buffer. Depending on α, i.e., the remainder bytes, a display engine can ignore some of the three remainder bytes. The bases are aligned with respect to the remainder location, as shown in the figure. Therefore, the pixel_obu f _pos becomes i + (3 − n) for ith base where 0 ≤ i < n. The number three represents the number of bases, which can be stored in the buffer.   Figure 7. Example of two input buffer operations when n = 2 and α = 2. The gray bytes represent a current payload. (a) A simple case that all the remainder payload bytes are available in cur_ibuf.
(b) A difficult case that all the remainder payload bytes are not available in cur_ibuf. Figure 7b shows a difficult case in that all the remainder payload bytes are not available in cur_ibuf at cycle t. We select the second base, b 7 b 6 b 5 b 4 , at cycle t and the remainder at the next cycle, t + 1. At cycle t + 1, the LSBs of the remainder are located in prev_ibuf ; thus, we need the right-shift by base_pos − l to select the remainder. A negative number means a left-shift. N/A in the position means that the value of the corresponding selector is not used.

Calculation of Selector Positions
The base position of the ith packet, the jth pixel, and the kth base of a transaction is expressed by the following using Equation (2): From Equation (3), we know that base_pos does not change within a pixel but increases by the remainder, α, for the next pixel. base_pos can be implemented simply by using 2-bit adders because only the lower 2 bits of the operation result need to be taken when l = 4.
We already discussed about controlling the remainder selector in Figure 7. The difficult case implies base_pos + α > l. Therefore, the selection value for the remainder selector, rem_pos, becomes: We calculate base_pos + α for every pixel, thus not requiring any additional resource for the calculation.
For ith base, the pixel_obu f _pos becomes i + (3 − n) where 0 ≤ i < n. The value of n is a constant, so a counter for pixel_obu f _pos is initialized to 3 − n for the first base. Figure 8 shows the very first portion of a packet in which 15-byte pixels are continuously transmitting. It represents the operation of each selector and pixel_obuf cycle by cycle when two pixels are transmitted consecutively. Since the pixel size is 15 bytes (n = 3, α = 3), we need to select the 4-byte base three times and a 3-byte remainder once to fill the pixel_obuf.

Pixel Alignment Example
As in Section 2.2, the DSI IP supports various pixel formats, so the output pixel size also varies. Therefore, pixels output from the DSI IP should be aligned to a specific fixed point. For example, sorting on the least significant byte or sorting on the most significant byte can be done depending on the size. In this paper, we implemented fixing the remainder position for the simplicity of the hardware. If necessary, using a base-pixel selector that aligns pixels from the least significant byte is possible with an additional large-size multiplexer for the reminder. The multiplexer stores the remainders of the 6-byte, 9-byte, and 15-byte multi-pixels in Table 1 in [6:5], [8], and [14:12] of the output buffer, respectively. Initially, base_pos value is set to zero. After that, base_pos is incremented by three for every pixel (because α = 3) by Equation (3). At cycle one, base_pos is zero; accordingly, the selected 4-byte base is grayed out. This selected base fills pixel_obuf according to pixel_obuf_pos. Pixel_obuf_pos increases by 1 from 3 − n to 2 and is repeated for every pixel. When processing various pixel sizes, fixing the position of the remainder and adjusting the base start position have the advantage, which does not require additional mux for the remainder. The remainder is selected at the same time as the base at cycle three and fills pixel_obuf. rem_pos has the same value of base_pos by Equation (5). The first pixel is the simple case where the remainder fills pixel_obuf with the last base at the same cycle since the remainder is available in cur_ibuf. The second pixel is the difficult case where the remainder fills pixel_obuf at the next cycle after the last base since the remainder is not available in cur_ibuf.

Design Verification and Performance Analysis
This section shows the functional verification of the proposed design on FPGA and the synthesis results using the 65 nm standard library for comparing it with the baseline design in terms of area and power consumption.

FPGA-Based Design Verification and Its Resource Usage
The experiment environment for the FPGA verification is shown in Figure 9a. We configured the environment by connecting the D-PHY signal generator from DGnT [15] and a display to the Xilinx Zynq UltraScale+ MPSoC evaluation platform [16]. We implemented Xilinx D-PHY Rx IP [17], which converts the signal of the D-PHY signal generator into a DSI input, DSI Rx IP, to which the proposed design is applied, and an HDMI output IP to display pixels from the payload on display on the FPGA. The signal generator generated a test pattern of 1080p@60 fps, and the DSI IP operated at 112 MHz. We tested crosshatch, checkboard, and horizontal/vertical RGB patterns. Figure 9b shows the horizontal RGB pattern on the HDMI monitor.

Synthesis Result
We performed the synthesis for the area and power analysis using the 65 nm standard library at an operating voltage of 1.08 V and a toggle rate of 10%. The operating frequency was 400 MHz for both designs, which is a sufficient operation speed that can support up to UHD (3840 × 2160)@60 fps. Furthermore, in the case of high-resolution displays, multiple Rx IPs are often used by dividing areas [18], so there is no problem in applying our proposal to commercial products. Table 2 compares the area of the DSI IP of the baseline with the proposed design. Both used the same LP (low power) Tx and Rx modules for the synthesis, setting configuration registers with the short packet. HS (high speed) Rx represents the baseline design in Figure 3 and our unified pixel converter in Figure 5. The proposed design showed a significant area reduction compared to the baseline: a 34.2% in the HS Rx and a total of 25.7%. Table 3 compares the major components of each design, explaining Table 2. Both designs can be divided into three main parts: two input buffers, two selectors, and the output alignment logic with multiplexers and buffers. Our design significantly reduced the output alignment from the small output buffer and the small multiplexer for pixel alignment.  Table 3. Component configuration.

Conclusions
As the amount of data transferred to display peripherals in a mobile environment increases, the interface's high bandwidth with a low power consumption becomes essential. The MIPI Alliance has been proposing the industry-standard MIPI DSI for this purpose. The display should support as many pixel formats as possible for the user's flexibility, but resulting in the implementation overhead for converting the packets into pixels.
In this paper, we proposed a low-cost unified pixel converter to reduce the complexity of the traditional design by classifying packets by a base and a remainder instead of a header and a payload. This concept dramatically reduces the alignment process for the pixel output, making it possible to create DSI IPs with a low hardware overhead while responding to various pixel formats. Our unified pixel-converter design can support various input sizes and output sizes through the base and the remainder concept. Therefore, our approach can be further extended to many other interfaces with a different number of lanes (i.e., 1 to 32 lanes on PCIe [19]) or standards with different data sizes to be aligned (i.e., 188-byte fixed-length packets used by MPEG-2 Transport Stream [20]). The proposed DSI IP was functionally verified in FPGA and synthesized with the Samsung 65 nm process standard library [21] and Synopsys Design Compiler [22], resulting in the reduction of 25.7% in the area and 38.6% in the power consumption.