The symptoms:
* BUG: unable to handle kernel paging request at 7f45402d
* invalid opcode: 0000 [#1] SMP
* general protection fault: 0000 [#1] SMP
* BUG: unable to handle kernel NULL pointer dereference at 0000000000000258
Then you probably have a memory corruption somewhere.
In my case, I didn't get these errors at first. I was adding objects to a list by calling list_add_tail(my_list, my_obj) and verifying that list_empty(my_list) was returning false as expected (and it was). But latter in the code when I called list_empty(my_list) again, it was returning true, and in nowhere in my code I was removing objects from the list.
Weird behaviors that doesn't make sense probably are due to memory corruption. When I started simplifying the code to isolate the problem, I start getting the errors above in the dmesg log.
Where it is happening:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[ 37.183001] BUG: unable to handle kernel NULL pointer dereference at 0000000000000258 | |
[ 37.183009] IP: [<ffffffff817c69f0>] _raw_spin_lock_irqsave+0x20/0x80 | |
[ 37.183019] PGD 0 | |
[ 37.183021] Oops: 0002 [#1] SMP | |
[ 37.183023] Modules linked in: vmc(O-) videobuf2_vmalloc videobuf2_memops videobuf2_core v4l2_common videodev media vmw_vsock_vmci_transport vsock vmwgfx ttm drm_kms_helper snd_ens1371 snd_ac97_codec ac97_bus gameport snd_pcm hid_generic drm snd_seq_midi snd_seq_midi_event snd_rawmidi btusb rfcomm usbhid bnep snd_seq btbcm hid btintel bluetooth snd_seq_device snd_timer snd vmw_balloon serio_raw vmw_vmci soundcore coretemp ppdev shpchp lp i2c_piix4 parport_pc parport mac_hid psmouse e1000 ahci libahci mptspi mptscsih mptbase scsi_transport_spi floppy vmw_pvscsi vmxnet3 pata_acpi | |
[ 37.183055] CPU: 0 PID: 2142 Comm: modprobe Tainted: G O 4.1.0-rc3+ #22 | |
[ 37.183057] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014 | |
[ 37.183058] task: ffff88002d3fef60 ti: ffff880075998000 task.ti: ffff880075998000 | |
[ 37.183060] RIP: 0010:[<ffffffff817c69f0>] [<ffffffff817c69f0>] _raw_spin_lock_irqsave+0x20/0x80 | |
[ 37.183063] RSP: 0018:ffff88007599bd40 EFLAGS: 00010092 | |
[ 37.183064] RAX: 0000000000000292 RBX: 0000000000000292 RCX: ffff88002d551818 | |
[ 37.183064] RDX: 0000000000020000 RSI: ffffffff81500b20 RDI: 0000000000000258 | |
[ 37.183065] RBP: ffff88007599bd48 R08: 000000000001a620 R09: ffff88007b61a620 | |
[ 37.183066] R10: ffffea0001e355c0 R11: ffffffff813a7e55 R12: 0000000000000258 | |
[ 37.183067] R13: ffff880076097018 R14: ffffffff81500b20 R15: ffffffff81500b30 | |
[ 37.183069] FS: 00007f1da3d54740(0000) GS:ffff88007b600000(0000) knlGS:0000000000000000 | |
[ 37.183070] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b | |
[ 37.183071] CR2: 0000000000000258 CR3: 00000000787f5000 CR4: 00000000000007f0 | |
[ 37.183123] Stack: | |
[ 37.183125] 0000000000000000 ffff88007599bd98 ffffffff81500d52 ffff88007599bd78 | |
[ 37.183127] ffff88002d551818 ffff880076097018 ffff88002d551818 0000000000000001 | |
[ 37.183128] ffff880076097018 00005590e6618410 00007ffe79be03d8 ffff88007599bda8 | |
[ 37.183130] Call Trace: | |
[ 37.183136] [<ffffffff81500d52>] devres_remove+0x32/0xb0 | |
[ 37.183138] [<ffffffff8150157e>] devres_destroy+0xe/0x30 | |
[ 37.183140] [<ffffffff8150165f>] devm_kfree+0x1f/0x40 | |
[ 37.183143] [<ffffffffc03ca6b0>] vmc_cap_destroy+0x40/0x50 [vmc] | |
[ 37.183145] [<ffffffffc03ca06f>] vmc_device_unregister+0x5f/0x90 [vmc] | |
[ 37.183147] [<ffffffffc03ca0b5>] vmc_remove+0x15/0x20 [vmc] | |
[ 37.183148] [<ffffffff814ff6cd>] platform_drv_remove+0x1d/0x40 | |
[ 37.183151] [<ffffffff814fcea7>] __device_release_driver+0x87/0x120 | |
[ 37.183153] [<ffffffff814fd948>] driver_detach+0xc8/0xd0 | |
[ 37.183155] [<ffffffff814fcb29>] bus_remove_driver+0x59/0xe0 | |
[ 37.183157] [<ffffffff814fe120>] driver_unregister+0x30/0x70 | |
[ 37.183158] [<ffffffff814ff7a2>] platform_driver_unregister+0x12/0x20 | |
[ 37.183160] [<ffffffffc03d0b9c>] vmc_exit+0x10/0x474 [vmc] | |
[ 37.183165] [<ffffffff810fde0c>] SyS_delete_module+0x1ac/0x230 | |
[ 37.183170] [<ffffffff81094f8c>] ? task_work_run+0xbc/0xf0 | |
[ 37.183173] [<ffffffff817c6ff2>] system_call_fastpath+0x16/0x75 | |
[ 37.183174] Code: 97 c8 8f ff 5d c3 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 53 9c 58 66 66 90 66 90 48 89 c3 fa 66 66 90 66 66 90 ba 00 00 02 00 <f0> 0f c1 17 89 d1 c1 e9 10 66 39 d1 75 06 48 89 d8 5b 5d c3 41 | |
[ 37.183190] RIP [<ffffffff817c69f0>] _raw_spin_lock_irqsave+0x20/0x80 | |
[ 37.183192] RSP <ffff88007599bd40> | |
[ 37.183193] CR2: 0000000000000258 |
1) The IP (instruction pointer) address in the 2nd line
2) The call trace at line 23rd
The IP
In the 2nd line of this log, we have the instruction pointer address, it is the address the computer was executing that generated the error.In this example, the IP is ffffffff817c69f0, then we can find where in the code this address corresponds to using the addr2line tool:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ addr2line -e path_to_your_kernel_tree/vmlinux 0xffffffff817c69f0 | |
path_to_your_kernel_tree//./arch/x86/include/asm/spinlock.h:106 |
The vmlinux file is the uncompressed version of the Linux image (it is created when you do make)
Note: if it doesn't work, check if your make menuconfig if your kernel is compiled with DEBUG flag
But this technique won't work if the corresponding address is not part of the kernel core code (if it was an error caused by some module, as we will see in the call trace section).
The call trace
In the log above, line 23rd, it means that the function devres_remove was called by devres_destroy, which was called by devm_kfree, which was called by vmc_cap_destroy and so on.Now suppose that the vmc_cap_destroy function call devm_kfree in many places and you want to know exactly which one has triggered the bug. Lets do this in 2 steps:
1) Get the offset in the code section of the function using nm:
nm is a tool to get the offset of a symbol in a section:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ nm drivers/media/vmc/vmc.ko | grep vmc_cap_destroy | |
0000000000000680 T vmc_cap_destroy |
2) Get the file and the line using addr2line:
In line 27th we have: vmc_cap_destroy+0x40, where the 0x40 is the offset inside the vmc_cap_destroy function where the code execution would return if the call to devm_kfree hadn't triggered the bug.
So lets add the function offset we found in the previous step with the offset inside the function: 0x680 + 0x40 = 0x6c0
And now we use the addr2line with the compile module.ko to find out where is it:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ addr2line -e path_to_your_kernel_tree/drivers/media/vmc/vmc.ko 0x6c0 | |
path_to_your_kernel_tree/drivers/media/vmc/vmc-capture.c:396 |
Enable debug prints:
sudo sh -c "echo 8 > /proc/sys/kernel/printk"
If this doesn't work, check if the DYNAMIC_DEBUG flag is enabled in your menuconfig, if so, then check the next section about Dynamic debug.
In the case of a module that I was testing (the vivid module), I needed to change the vivid_debug parameter:
We can do this when we start the module:
sudo modprobe vivid vivid_debug=8
Or changing the parameter while its already running:
sudo sh -c "echo -n 0 > sys/module/vivid/parameters/vivid_debug"
Dynamic debug:
If your kernel is compiled with DYNAMIC_DEBUG flag, then changing the printk level probably won't enable the debug prints in the dmesg log.Lets say the module we are working is called media, lets add it into the kernel:
sudo modprobe media
And now we will enable all the debug prints in the media module by sending 'module media +p' to the dynamic_debug/control file:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ sudo sh -c "echo -n 'module media +p' > /sys/kernel/debug/dynamic_debug/control" | |
$ sudo cat /sys/kernel/debug/dynamic_debug/control | grep "\[media\]" | |
drivers/media/media-entity.c:301 [media]media_entity_pipeline_start =p "\042%s\042:%u must be connected by an enabled link\012" | |
drivers/media/media-entity.c:287 [media]media_entity_pipeline_start =p "link validation failed for \042%s\042:%u -> \042%s\042:%u, error %d\012" |
To disable the prints, we send 'module media -p' to the dynamic_debug/control file:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ sudo sh -c "echo -n 'module media -p' > /sys/kernel/debug/dynamic_debug/control" | |
$ sudo cat /sys/kernel/debug/dynamic_debug/control | grep "\[media\]" | |
drivers/media/media-entity.c:301 [media]media_entity_pipeline_start =_ "\042%s\042:%u must be connected by an enabled link\012" | |
drivers/media/media-entity.c:287 [media]media_entity_pipeline_start =_ "link validation failed for \042%s\042:%u -> \042%s\042:%u, error %d\012" |
Enabling a specific print
Instead of enabling all the debug prints of the entire module, we can enable all the prints on a specific file, or we can specify a file and a line:
$ sudo sh -c "echo -n 'file media-entity.c +p' > /sys/kernel/debug/dynamic_debug/control"
$ sudo sh -c "echo -n 'file media-entity.c line 301 +p' > /sys/kernel/debug/dynamic_debug/control"
Read dmesg log continuously
Instead of calling dmesg every time to check the log, you can do:
$ tail -f /var/log/kern.log
and the log will be printed continuously.
and the log will be printed continuously.
Thanks Helen for a very informative article! I've used a couple of the tools named but I'm really happy that I got to know new debugging techniques and tools too! I have two questions though:
ReplyDeleteWhy did you decide that the problem is memory corruption?
You introduced a problem a problem that occurred to you and then provided the techniques that you followed to handle the issue. What happened to your problem at the end? I know that you solved it but I would like to know details. =D
Thanks for asking :)
DeleteI decided that my problem was about memory corruption because I had a list with elements, and this list was empty in some point in the code, but I never removed the elements from it.
I was working with a complex topology (described in this post http://helenfornazier.blogspot.com.br/2015/06/outreachy-status-log-third-week-basic.html), then I tried to simplify it a bit, using only the "Sensor A" to generate the image and the "Capture A" to export this image to user space (without the other nodes as debayers, sensor B or capture B).
When I simplified it, I got a kernel message error (kernel Oops) saying I was accessing some region of the code that I shouldn't.
I wasn't getting this error with the complex topology because of a coincidence (explained below). Then I find out the line of the code which was generating the error and I realize I was accessing an array like pipe[ent_num] instead of pipe[ent_num-1].
The coincidence:
In the complex topology, when the "Capture A" (/dev/video0) is opened to read a frame, it activates the "Sensor A" to generate the frames periodically.
The order of the entities in the pipe array was something like: Sensor A, Sensor B, Debayer A, Debayer B, Capture A, Capture B, Scaler.
When the "Sensor A" generates a frame, it is sent to the "Capture A" through the pipe array and using the index ent_num, but as the minus 1 was missing, the frame was being sent to the "Capture B" instead of "Capture A", and I wasn't getting the error because the "Capture B" has the same code of "Capture A".
CONFIG_SLUB_DEBUG_ON=y can help in some instances of memory corruption.
ReplyDeleteThis is so amazing post very interesting.Keep sharing dude.Well done.
ReplyDeleteDownload Latest Version HTTP Debugger Pro Crack
Thanks for Sharing such an useful & informative stuff...
ReplyDeletemule training
ReplyDeleteThank you for sharing such a great information.Its really nice and informative.hope more posts from you. I also want to share some information recently i have gone through and i had find the one of the best mulesoft tutorial videos
mmorpg oyunlar
ReplyDeleteinstagram takipçi satin al
Tiktok Jeton Hilesi
tiktok jeton hilesi
antalya saç ekimi
referans kimliği nedir
İNSTAGRAM TAKİPÇİ SATIN AL
Mt2 pvp serverler
instagram takipçi satın al
tül perde modelleri
ReplyDeletesms onay
mobil ödeme bozdurma
NFT NASIL ALİNİR
ANKARA EVDEN EVE NAKLİYAT
trafik sigortası
dedektor
Kurma Website
Aşk romanları
Smm Panel
ReplyDeleteSmm Panel
İS İLANLARİ
İnstagram Takipçi Satın Al
hirdavatciburada.com
beyazesyateknikservisi.com.tr
servis
Tiktok Jeton Hilesi
Good content. You write beautiful things.
ReplyDeletehacklink
hacklink
vbet
sportsbet
mrbahis
taksi
vbet
mrbahis
korsan taksi
hatay
ReplyDeletekars
mardin
samsun
urfa
D7U6Q
bitlis
ReplyDeleteedirne
hatay
ağrı
urfa
ZZ8O
https://saglamproxy.com
ReplyDeletemetin2 proxy
proxy satın al
knight online proxy
mobil proxy satın al
ULGH
fbvvbnvnbvnmbvmbmn
ReplyDeleteصيانة افران الغاز بجدة
fghfghghjghjgjkhkj
ReplyDeleteصيانة افران الغاز بجدة
شركة تنظيف مجالس بالجبيل
ReplyDeleteU1EOT21TeSR