Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rockchip64-edge: add rkvdec2 for rk356x #6804

Merged
merged 5 commits into from
Jul 28, 2024

Conversation

amazingfate
Copy link
Contributor

Description

We can also use rkvdec2 on rk356x.

  • Remove all non-coherent-dst-bufs patches because they are suggested by a chromium developer when using libyuvimageprocessor to do NV12->AR24 convert. Now we use wayland gbm to render NV12 directly, so these patches are not necessary.
  • Add rkvdec2 nodes to devicetree of rk356x.
  • Enable rkvdec2 in kernel config and change CMA size from 128M to 384M for chromium's 4K h264 decoding.
  • Change hantro g1 vpu compatible string from rockchip,rk3568-vpu to rockchip,rk3328-vpu because they have exactly the same vpu variant data.
  • Disable h264 decoder of rk3328-vpu because there are aleady rkvdec/rkvdec2 h264 decoders on rk3328 and rk356x. Multi H264 decoders will make gstreamer and chromium confused.
  • Add chromium v4l2 decoder udev rule for rk3328-vpu

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Please also note any relevant details for your test configuration.

Checklist:

Please delete options that are not relevant.

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

@github-actions github-actions bot added size/medium PR with more then 50 and less then 250 lines Hardware Hardware related like kernel, U-Boot, ... Framework Framework components Patches Patches related to kernel, U-Boot, ... labels Jun 26, 2024
@paolosabatino
Copy link
Contributor

I wonder why such big CMA windows is needed for rkvdec2; I mean, rkvdec v1 AFAIK does not require CMA because the device has its own MMU and can freely access kernel memory. I don't know if something changed for rkvdec2, but such a regression would surely would disappoint me.

I can decode 4K content on 32 bit rk3228 with 1Gb devices and just 16Mb CMA window.

CMA turns as "reserved" memory to the linux kernel and therefore it cannot be used by userland processes. Since the rockchip64 kernel also applies to smaller rk3328 devices, reserving 384M for those would hamper them.

As said, everything could be changed for rkvdec2, but please verify if it is really needed.

@amazingfate
Copy link
Contributor Author

I wonder why such big CMA windows is needed for rkvdec2; I mean, rkvdec v1 AFAIK does not require CMA because the device has its own MMU and can freely access kernel memory. I don't know if something changed for rkvdec2, but such a regression would surely would disappoint me.

I can decode 4K content on 32 bit rk3228 with 1Gb devices and just 16Mb CMA window.

CMA turns as "reserved" memory to the linux kernel and therefore it cannot be used by userland processes. Since the rockchip64 kernel also applies to smaller rk3328 devices, reserving 384M for those would hamper them.

As said, everything could be changed for rkvdec2, but please verify if it is really needed.

This 384M CMA size is copied from rk3588's kernel config. At first I did not touch it but when playing 4K video with chromium there was cma not enough error from dmesg, and chromium failed back to software decoding, so I just make CMA size same as rk3588's kernel.

Maybe 384M CMA size is not the minimal requirement, I can check it. But if I recall it correctly the upcoming rk3588's hdmirx driver also need large CMA size: https://github.com/armbian/linux-rockchip/blob/rk-5.10-rkr6/arch/arm64/boot/dts/rockchip/rk3588-rock-5b.dts#L210, I need 384M for 4K hdmi in capture with downstream kernel.

If we have a plan to merge rk3588 family to rockchip64 eventually, we have to increase the CMA size.

@ColorfulRhino
Copy link
Collaborator

ColorfulRhino commented Jun 26, 2024

CMA turns as "reserved" memory to the linux kernel and therefore it cannot be used by userland processes. Since the rockchip64 kernel also applies to smaller rk3328 devices, reserving 384M for those would hamper them.

Yeah this is what I was wondering as well. I don't know enough about this to know if this will "steal" 384MB of RAM to reserve for the VPU or GPU. But if that's the case, that would be a big deal.

I mean, just look at the Raspberry Pi kernel disabling an otherwise good/positive kernel config option to fix an issue that has already been fixed in mainline, just to save 1MB of RAM: raspberrypi/linux#5975 (comment)

If there are Rockchip64 devices with only like 2GB of RAM, 384MB is a huge chunk.

Maybe this issue should be reported to mainline or the people developing rkvdec2. If they are already aware, maybe we should wait until it's fixed?

@SteeManMI
Copy link
Contributor

From what I've read the CMA pool can be set in the device tree, as a kernel boot parameter or in the kernel config (and the order of presidince is in that order). So it seems that there are some options on how and where this could be implemented so that it can vary by board, so low memory devices could go with a lower value.

@amazingfate
Copy link
Contributor Author

Maybe 384M CMA size is not the minimal requirement, I can check it.

From my test 192M cma is enough for rkvdec2, and 184M is not enough.

From what I've read the CMA pool can be set in the device tree, as a kernel boot parameter or in the kernel config (and the order of presidince is in that order). So it seems that there are some options on how and where this could be implemented so that it can vary by board, so low memory devices could go with a lower value.

I tried to add 192M reserved cma to devicetree with this overlay:

/dts-v1/;
/plugin/;
/ {
	fragment@0 {
		target-path = "/";
		__overlay__ {
			reserved-memory {
				#address-cells = <2>;
				#size-cells = <2>;
				ranges;

				cma {
					compatible = "shared-dma-pool";
					reusable;
					size = <0x0 (192 * 0x100000)>;
				};
			};
		};
	};
};

But it seems that this just add an additional 192M cma memory:

$ dmesg |grep cma
[    0.000000] OF: reserved mem: initialized node cma, compatible id shared-dma-pool
[    0.000000] OF: reserved mem: 0x00000001f4000000..0x00000001ffffffff (196608 KiB) map reusable cma
[    0.000000] cma: Reserved 184 MiB at 0x00000000dc000000 on node -1
[    0.000000] Memory: 3608232K/4192256K available (17280K kernel code, 2378K rwdata, 5416K rodata, 4864K init, 622K bss, 199000K reserved, 385024K cma-reserved)
[   98.348794] cma: cma_alloc: reserved: alloc failed, req-size: 4050 pages, ret: -12
[   98.349502] cma: number of available pages: 120@2056+46@23762+46@27858+46@31954+46@36050+46@40146+2862@44242=> 3212 free of 47104 total pages

184 MiB cma is set in kernel config, and total cma memory size 385024K(376M) is 192M + 184M.
And chromium/gstreamer still failed to use rkvdec2 decoder.

@paolosabatino
Copy link
Contributor

@amazingfate did you try to reduce the size of the cma configured in the kernel to a small number (say 16 or 32 megabytes) or even disable it and declare a large cma (384M) in the device tree to see what happens? The intent of the cma is to be linearly addressable as well, if you have more smaller chunks it could be that the client can't allocate a large single chunk.

@amazingfate
Copy link
Contributor Author

While setting cma size with kernel cmdline works. And cma size 256M is okay for both gstreamer and chromium.

@amazingfate did you try to reduce the size of the cma configured in the kernel to a small number (say 16 or 32 megabytes) or even disable it and declare a large cma (384M) in the device tree to see what happens? The intent of the cma is to be linearly addressable as well, if you have more smaller chunks it could be that the client can't allocate a large single chunk.

After setting cma size to 0 in kernel config, gstreamer will not get cma even there is 256M cma configured in device tree. It seems that v4l2 driver will always get cma set from kernel config or cmdline.

@ColorfulRhino
Copy link
Collaborator

ColorfulRhino commented Jun 28, 2024

I wonder how mainline does/will handle this CMA size 😄

@amazingfate
Copy link
Contributor Author

I'm thinking of splitting bootenv for rk35xx socs to use larger cma size so old socs like rk3328 can still play with 128M reserved cma.

@igorpecovnik igorpecovnik added the Work in progress Unfinished / work in progress label Jun 30, 2024
@paolosabatino
Copy link
Contributor

After setting cma size to 0 in kernel config, gstreamer will not get cma even there is 256M cma configured in device tree. It seems that v4l2 driver will always get cma set from kernel config or cmdline.

I wonder why they are treated differently.
In the meantime, isn't easier to add a rk356x.txt in config/bootenv to include the extraargs=cma=384M , rather than splitting the boot script?

@amazingfate
Copy link
Contributor Author

After setting cma size to 0 in kernel config, gstreamer will not get cma even there is 256M cma configured in device tree. It seems that v4l2 driver will always get cma set from kernel config or cmdline.

I wonder why they are treated differently. In the meantime, isn't easier to add a rk356x.txt in config/bootenv to include the extraargs=cma=384M , rather than splitting the boot script?

This is indeed a better way. I will try that.

@paolosabatino
Copy link
Contributor

Just for knowledge, but on my rk3318 box (which is totally similar to rk3328) I can achieve hardware video decoding with CMA completely disabled using extraargs=cma=0 in /boot/armbianEnv.txt:

root@rk3318-box:~# cat /proc/meminfo | grep -i cma
CmaTotal:              0 kB
CmaFree:               0 kB

This is the mpv process and its threads cpu usage via top while streaming 1080p h.264 content from network:

top - 15:39:42 up 7 min,  2 users,  load average: 1.17, 0.88, 0.45
Threads:  11 total,   1 running,  10 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.4 us,  6.9 sy,  0.0 ni, 85.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st 
MiB Mem :   1979.4 total,   1313.3 free,    457.7 used,    325.8 buff/cache     
MiB Swap:    989.7 total,    989.7 free,      0.0 used.   1521.7 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                     
   1611 paolo     20   0 1198072 275760 118144 S  51.5  13.6   3:24.09 mpv                                                                                                                                                         
   1622 paolo     20   0 1198072 275760 118144 R  23.3  13.6   1:25.57 mpv/vo                                                                                                                                                      
   1613 paolo     20   0 1198072 275760 118144 S   6.3  13.6   0:22.83 mpv/lua script                                                                                                                                              
   1626 paolo     20   0 1198072 275760 118144 S   3.3  13.6   0:11.66 mpv/ao                                                                                                                                                      
   1619 paolo     20   0 1198072 275760 118144 S   0.3  13.6   0:11.72 mpv/demux                                                                                                                                                   
   1612 paolo     20   0 1198072 275760 118144 S   0.0  13.6   0:00.07 mpv/terminal                                                                                                                                                
   1614 paolo     20   0 1198072 275760 118144 S   0.0  13.6   0:00.02 mpv/lua script                                                                                                                                              
   1615 paolo     20   0 1198072 275760 118144 S   0.0  13.6   0:00.03 mpv/lua script                                                                                                                                              
   1616 paolo     20   0 1198072 275760 118144 S   0.0  13.6   0:00.10 mpv/lua script                                                                                                                                              
   1620 paolo     20   0 1198072 275760 118144 S   0.0  13.6   0:00.00 mpv/worker                                                                                                                                                  
   1623 paolo     39  19 1198072 275760 118144 S   0.0  13.6   0:00.00 mpv:disk$0

and content is playing smooth, just with some occasional dropped frames while I tinker in the background.

And here is dmesg:
dmesg.colored.log.gz

@paolosabatino
Copy link
Contributor

paolosabatino commented Jul 1, 2024

Found also another interesting thing about the device tree: the cma node must be specified with reusable property and without no-map property to be shown as cma-reserved in the memory summary.
Only this kind of node:


/ {

    ...


    reserved-memory {

                #address-cells = <2>;
                #size-cells = <2>;
                ranges;

                cma-reserved {
                        compatible = "shared-dma-pool";
                        reusable;
                        size = <0x0 0x4000000>;
                        alignment = <0x0 0x2000>;
                        linux,cma-default;
                        linux,dma-default;
                };

        };

    ....

};

Actually gives me 64mb of CMA memory:

root@rk3318-box:/boot/dtb/rockchip# grep -i cma /proc/meminfo
CmaTotal:          65536 kB
CmaFree:           56816 kB

If reusable property is missing, then I get 0 kB in CmaTotal row (although the reserved memory account in dmesg increases)

@amazingfate you may give a try to this

@Tonymac32
Copy link
Member

TBH rockchip64 should be split into RK33 and RK35 so these interactions aren't problematic. Even that split is sill since RK3308 and RK3399 have nothing in common, but at least the mad scientist lab can be kept far away from stable hardware

@paolosabatino
Copy link
Contributor

TBH rockchip64 should be split into RK33 and RK35 so these interactions aren't problematic. Even that split is sill since RK3308 and RK3399 have nothing in common, but at least the mad scientist lab can be kept far away from stable hardware

I wouldn't do that, not after struggling to keep the number of kernels under control... I mean, if you look at allwinner, 32 bits boards and 64 bits boards happily live together with the same patchset (with tons of patches, though) and just two kernels.

The intent should be to merge things into current/edge kernels when the hardware is actually ready for mainline; mad scientist things should live in the mad scientists' own repository or branch; I see the comments above as "what to do to coexist", and the confrontation looks sane to me.

It would ideal if people asks themselves "what other board or system I can break with this change?" (ie. take responsibility) before bringing in random patches that makes its hardware "just work".

@amazingfate
Copy link
Contributor Author

Found also another interesting thing about the device tree: the cma node must be specified with reusable property and without >no-map property to be shown as cma-reserved in the memory summary.

This method works! So I can just add this reserved memory node to rk356x.dtsi and no need to change the CMA size in kernel config.
Sorry for late reply and I'm going to continue with this pr.

@amazingfate
Copy link
Contributor Author

I find that CmaFree size is different when cma is set with kernel arg cma=256M and devicetree node:
Set with kernel arg cma=256M:

$ grep -i cma /proc/meminfo
CmaTotal:         262144 kB
CmaFree:          252384 kB

Set with devicetree node:

$ grep -i cma /proc/meminfo
CmaTotal:         262144 kB
CmaFree:           25872 kB

When cma is set to 256M with devicetree, rkvdec2 driver will still failed with dma alloc of size 16588800 failed.

@paolosabatino
Copy link
Contributor

paolosabatino commented Jul 16, 2024

@amazingfate there is nothing to excuse about, on the contrary I find all of this very interesting. I wonder why rkvdec2 complaints and why CmaFree is different when set via device tree.

Reading the (old) LWA article about CMA, it says that it is possible for the kernel to use the CMA-allocated memory for other uses like disk cache, but when a device driver requires it, the disk cache is promptly erased and the memory is given to the device driver that requires the chunk.

During the above tests with the rk3328, there were also situations were CMA free memory lowered after playing some videos and toying around, and it didn't get reclaimed by the kernel after closing the supposed users, so I still don't understand those CmaTotal and CmaFree values.

@paolosabatino
Copy link
Contributor

@amazingfate While reviewing the rockchip64 kernel config, I stumbled upon these "system heap" and "CMA heap" config options:

image

Maybe it is worth enabling them and give it a shot? I know that have been enabled on libreelec for other rockchip platforms to let the video decoding work.

@amazingfate
Copy link
Contributor Author

@amazingfate While reviewing the rockchip64 kernel config, I stumbled upon these "system heap" and "CMA heap" config options:

image

Maybe it is worth enabling them and give it a shot? I know that have been enabled on libreelec for other rockchip platforms to let the video decoding work.

I enabled CONFIG_DMABUF_HEAPS_SYSTEM=y and CONFIG_DMABUF_HEAPS_CMA=y, still only kernel cmdline cma=256M works with rkvdec2. Setting cma 256M in dts won't work.

@amazingfate amazingfate force-pushed the rkvdec2-rk356x branch 3 times, most recently from ff15ff6 to b072b85 Compare July 27, 2024 08:02
@amazingfate
Copy link
Contributor Author

Update patches to 6.10, and add config/bootenv/rk356x.txtto let rk356x boards boot with kernel arg cma=256M.

@amazingfate
Copy link
Contributor Author

Tested with rock-3a, chromium can decode 4k 2160p h264 video with vpu.

@amazingfate amazingfate removed the Work in progress Unfinished / work in progress label Jul 28, 2024
@amazingfate amazingfate merged commit 94ef30d into armbian:main Jul 28, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BSP Board Support Packages Framework Framework components Hardware Hardware related like kernel, U-Boot, ... Patches Patches related to kernel, U-Boot, ... size/medium PR with more then 50 and less then 250 lines
Development

Successfully merging this pull request may close these issues.

6 participants