View: 1476|Reply: 9

SER5 5700U crashes (CPU?)

[Copy link]

1

Threads

5

Posts

45

Credits

Newbie

Rank: 1

Credits
45
Post time 2024-10-30 04:57:30 | Show all posts |Read mode
Edited by ZTHawk at 2024-10-30 05:22

Hello, I have a SER5 5700U with 32GB RAM (2 * 16GB). BIOS is v509.
Since some time now, I have crashing/freeze issues.

I am running proxmox (8.2.7) with ~6 VMs. The CPU is idling at ~6% and rarely goes above 12%.

Sometimes it is running like 3 days and sometimes it is crashing after ~3 hours.
I also tried with only one RAM stick (both tested) with same result.



  1. 22:03:16 kernel: kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
  2. 22:03:16 kernel: BUG: unable to handle page fault for address: ffff9b36049dd180
  3. 22:03:16 kernel: #PF: supervisor instruction fetch in kernel mode
Copy the Code
  1. 23:29:21 kernel: mce: [Hardware Error]: Machine check events logged
  2. 23:29:21 kernel: [Hardware Error]: Corrected error, no action required.
  3. 23:29:21 kernel: [Hardware Error]: CPU:2 (17:68:1) MC1_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0x9c20000000010859
  4. 23:29:21 kernel: [Hardware Error]: Error Addr: 0x000000045a394cc0
  5. 23:29:21 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000005a020300
  6. 23:29:21 kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 1
  7. 23:29:21 kernel: [Hardware Error]: cache level: L1, mem/io: IO, mem-tx: IRD, part-proc: SRC (no timeout)
Copy the Code
  1. 11:00:25 systemd[1]: Finished pve-guests.service - PVE guests.
  2. 11:00:25 systemd[1]: Starting pvescheduler.service - Proxmox VE scheduler...
  3. 11:00:26 kernel: BUG: unable to handle page fault for address: 0000000006fff453
  4. 11:00:26 kernel: #PF: supervisor read access in kernel mode
  5. 11:00:26 kernel: #PF: error_code(0x0000) - not-present page
  6. 11:00:26 kernel: PGD 0 P4D 0
  7. 11:00:26 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
  8. 11:00:26 kernel: CPU: 2 PID: 2238 Comm: CPU 0/KVM Tainted: P O 6.8.12-2-pve #1
  9. 11:00:26 kernel: Hardware name: AZW SER/SER, BIOS SER5H509 01/24/2024
  10. 11:00:26 kernel: RIP: 0010:svm_handle_exit+0x5c/0x200 [kvm_amd]
  11. 11:00:26 kernel: Code: 44 8b 70 70 41 80 fc 01 0f 87 d4 f2 00 00 41 83 e4 01 75 77 48 8b 83 d8 19 00 00 48 8b 10 f7 c2 00 00 01 00 0f 84 16 01 00 00 <44> 0f b6 25 9c 3a 01 00 41 80 fc 01 0f 87 6c f2 00 00 41 83 e4 01
  12. 11:00:26 kernel: RSP: 0018:ffffb5cb54c13c00 EFLAGS: 00010246
  13. 11:00:26 kernel: RAX: 0000000080050033 RBX: ffff8ad9e2953900 RCX: ffff8ad9e43d2000
  14. 11:00:26 kernel: RDX: 00ff00ff00100010 RSI: 0000000000000000 RDI: ffff8ad9e2953900
  15. 11:00:26 kernel: RBP: ffffb5cb54c13c30 R08: 000000281bfc5528 R09: 0000000000000000
  16. 11:00:26 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
  17. 11:00:26 kernel: R13: ffffb5cb54a6d000 R14: 0000000000000060 R15: 0000000000000000
  18. 11:00:26 kernel: FS: 000073101da006c0(0000) GS:ffff8ae010b00000(0000) knlGS:0000000000000000
  19. 11:00:26 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  20. 11:00:26 kernel: CR2: 0000000006fff453 CR3: 000000011024c000 CR4: 0000000000350ef0
  21. 11:00:26 kernel: Call Trace:
  22. 11:00:26 kernel: <TASK>
  23. 11:00:26 kernel: ? show_regs+0x6d/0x80
  24. 11:00:26 kernel: ? __die+0x24/0x80
  25. 11:00:26 kernel: ? page_fault_oops+0x176/0x500
  26. 11:00:26 kernel: ? do_user_addr_fault+0x2ed/0x660
  27. 11:00:26 kernel: ? exc_page_fault+0x83/0x1b0
  28. 11:00:26 kernel: ? asm_exc_page_fault+0x27/0x30
  29. 11:00:26 kernel: ? svm_handle_exit+0x5c/0x200 [kvm_amd]
  30. 11:00:26 kernel: kvm_arch_vcpu_ioctl_run+0xd5b/0x1760 [kvm]
  31. 11:00:26 kernel: ? do_futex+0x128/0x230
  32. 11:00:26 kernel: ? srso_return_thunk+0x5/0x5f
  33. 11:00:26 kernel: kvm_vcpu_ioctl+0x297/0x800 [kvm]
  34. 11:00:26 kernel: ? do_syscall_64+0x8d/0x170
  35. 11:00:26 kernel: ? srso_return_thunk+0x5/0x5f
  36. 11:00:26 kernel: ? do_syscall_64+0x8d/0x170
  37. 11:00:26 kernel: ? srso_return_thunk+0x5/0x5f
  38. 11:00:26 kernel: ? srso_return_thunk+0x5/0x5f
  39. 11:00:26 kernel: ? kvm_on_user_return+0x78/0xd0 [kvm]
  40. 11:00:26 kernel: __x64_sys_ioctl+0xa3/0xf0
  41. 11:00:26 kernel: x64_sys_call+0xa68/0x24b0
  42. 11:00:26 kernel: do_syscall_64+0x81/0x170
  43. 11:00:26 kernel: ? srso_return_thunk+0x5/0x5f
  44. 11:00:26 kernel: ? kvm_vcpu_ioctl+0x30e/0x800 [kvm]
  45. 11:00:26 kernel: ? srso_return_thunk+0x5/0x5f
  46. 11:00:26 kernel: ? syscall_exit_to_user_mode+0x89/0x260
  47. 11:00:26 kernel: ? srso_return_thunk+0x5/0x5f
  48. 11:00:26 kernel: ? do_syscall_64+0x8d/0x170
  49. 11:00:26 kernel: ? srso_return_thunk+0x5/0x5f
  50. 11:00:26 kernel: ? syscall_exit_to_user_mode+0x89/0x260
  51. 11:00:26 kernel: ? srso_return_thunk+0x5/0x5f
  52. 11:00:26 kernel: ? kvm_on_user_return+0x78/0xd0 [kvm]
  53. 11:00:26 kernel: ? srso_return_thunk+0x5/0x5f
  54. 11:00:26 kernel: ? fire_user_return_notifiers+0x3a/0x80
  55. 11:00:26 kernel: ? srso_return_thunk+0x5/0x5f
  56. 11:00:26 kernel: ? syscall_exit_to_user_mode+0x89/0x260
  57. 11:00:26 kernel: ? srso_return_thunk+0x5/0x5f
  58. 11:00:26 kernel: ? do_syscall_64+0x8d/0x170
  59. 11:00:26 kernel: ? srso_return_thunk+0x5/0x5f
  60. 11:00:26 kernel: ? do_syscall_64+0x8d/0x170
  61. 11:00:26 kernel: ? irqentry_exit+0x43/0x50
  62. 11:00:26 kernel: ? srso_return_thunk+0x5/0x5f
  63. 11:00:26 kernel: entry_SYSCALL_64_after_hwframe+0x78/0x80
  64. 11:00:26 kernel: RIP: 0033:0x731021d79c5b
  65. 11:00:26 kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
  66. 11:00:26 kernel: RSP: 002b:000073101d9faee0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  67. 11:00:26 kernel: RAX: ffffffffffffffda RBX: 000064d7778cb840 RCX: 0000731021d79c5b
  68. 11:00:26 kernel: RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000022
  69. 11:00:26 kernel: RBP: 000000000000ae80 R08: 0000000000000000 R09: 0000000000000000
  70. 11:00:26 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  71. 11:00:26 kernel: R13: 0000000000000006 R14: 0000000000000060 R15: 0000000000000000
  72. 11:00:26 kernel: </TASK>
  73. 11:00:26 kernel: Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables bonding tls softdog sunrpc nfnetlink_log nfnetlink binfmt_misc intel_rapl_msr intel_rapl_common snd_acp_legacy_mach snd_acp_mach snd_soc_nau8821 snd_soc_dmic snd_acp3x_pdm_dma snd_acp3x_rn snd_sof_amd_acp63 edac_mce_amd snd_sof_amd_vangogh snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci iwlmvm snd_sof_xtensa_dsp kvm_amd snd_sof mac80211 snd_hda_codec_realtek snd_sof_utils amdgpu kvm snd_soc_core snd_hda_codec_generic libarc4 snd_hda_codec_hdmi snd_compress btusb ac97_bus btrtl amdxcp snd_pcm_dmaengine drm_exec irqbypass snd_hda_intel btintel snd_pci_ps gpu_sched crct10dif_pclmul snd_intel_dspcfg btbcm snd_rpl_pci_acp6x polyval_clmulni snd_intel_sdw_acpi drm_buddy snd_acp_pci btmtk drm_suballoc_helper polyval_generic snd_hda_codec drm_ttm_helper ghash_clmulni_intel snd_acp_legacy_common snd_hda_core ttm sha256_ssse3 iwlwifi snd_pci_acp6x bluetooth
  74. 11:00:26 kernel: sha1_ssse3 snd_hwdep aesni_intel snd_pci_acp5x drm_display_helper snd_pcm crypto_simd ecdh_generic cryptd cec snd_timer cdc_acm ecc snd_rn_pci_acp3x snd_acp_config cfg80211 rc_core snd_soc_acpi snd wmi_bmof cm32181 rapl i2c_algo_bit soundcore snd_pci_acp3x serio_raw ccp pcspkr k10temp industrialio mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c xhci_pci nvme xhci_pci_renesas crc32_pclmul psmouse ahci nvme_core xhci_hcd r8169 libahci i2c_piix4 nvme_auth video realtek wmi i2c_hid_acpi i2c_hid hid
  75. 11:00:26 kernel: CR2: 0000000006fff453
  76. 11:00:26 kernel: ---[ end trace 0000000000000000 ]---
  77. 11:00:26 kernel: RIP: 0010:svm_handle_exit+0x5c/0x200 [kvm_amd]
  78. 11:00:26 kernel: Code: 44 8b 70 70 41 80 fc 01 0f 87 d4 f2 00 00 41 83 e4 01 75 77 48 8b 83 d8 19 00 00 48 8b 10 f7 c2 00 00 01 00 0f 84 16 01 00 00 <44> 0f b6 25 9c 3a 01 00 41 80 fc 01 0f 87 6c f2 00 00 41 83 e4 01
  79. 11:00:26 kernel: RSP: 0018:ffffb5cb54c13c00 EFLAGS: 00010246
  80. 11:00:26 kernel: RAX: 0000000080050033 RBX: ffff8ad9e2953900 RCX: ffff8ad9e43d2000
  81. 11:00:26 kernel: RDX: 00ff00ff00100010 RSI: 0000000000000000 RDI: ffff8ad9e2953900
  82. 11:00:26 kernel: RBP: ffffb5cb54c13c30 R08: 000000281bfc5528 R09: 0000000000000000
  83. 11:00:26 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
  84. 11:00:26 kernel: R13: ffffb5cb54a6d000 R14: 0000000000000060 R15: 0000000000000000
  85. 11:00:26 kernel: FS: 000073101da006c0(0000) GS:ffff8ae010b00000(0000) knlGS:0000000000000000
  86. 11:00:26 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  87. 11:00:26 kernel: CR2: 0000000006fff453 CR3: 000000011024c000 CR4: 0000000000350ef0
  88. 11:00:26 kernel: note: CPU 0/KVM[2238] exited with irqs disabled
  89. 11:00:26 pvescheduler[2466]: starting server
  90. 11:00:26 systemd[1]: Started pvescheduler.service - Proxmox VE scheduler.
  91. 11:00:26 systemd[1]: Reached target multi-user.target - Multi-User System.
  92. 11:00:26 systemd[1]: Reached target graphical.target - Graphical Interface.
  93. 11:00:26 systemd[1]: Starting systemd-update-utmp-runlevel.service - Record Runlevel Change in UTMP...
  94. 11:00:26 systemd[1]: systemd-update-utmp-runlevel.service: Deactivated successfully.
  95. 11:00:26 systemd[1]: Finished systemd-update-utmp-runlevel.service - Record Runlevel Change in UTMP.
  96. 11:00:26 systemd[1]: Startup finished in 5.991s (firmware) + 6.316s (loader) + 3.231s (kernel) + 1min 21.266s (userspace) = 1min 36.806s.
  97. 11:00:27 chronyd[1117]: Selected source 80.153.195.191 (2.debian.pool.ntp.org)
Copy the Code


In most cases I have either error 2 or 3.

Best way to force it seems to start compilation on some VMs. Those VMs will then get segmentation faults (inside the VM). In host the above errors do occur (either immediately or with a small delay).

What could be the root cause?


Reply

Use magic Report

1

Threads

443

Posts

1423

Credits

Moderator

Rank: 7Rank: 7Rank: 7

Credits
1423
Post time 2024-10-30 16:47:20 | Show all posts
Hello there,
1.Check the power supply to the system. Fluctuations in power or insufficient power can cause hardware errors. If possible, try connecting the system to a different power source or using a power - conditioning device.
2. Virtualization Configuration
Review the VM configurations in Proxmox. Check the memory allocation, CPU assignment, and device passthrough settings for each VM. Make sure that they are set up correctly according to the requirements of the guest operating systems running inside the VMs.
Try creating a new, simple VM with minimal resources and see if it also exhibits the same problems. This can help you determine if the issue is specific to certain VM configurations or more general to the virtualization environment.
3. Software Updates
Update Proxmox to the latest version and any associated packages. Check the Proxmox update logs to see if there are any known issues or improvements related to the errors you're experiencing.
Reply Support Not support

Use magic Report

1

Threads

5

Posts

45

Credits

Newbie

Rank: 1

Credits
45
 Author| Post time 2024-11-02 22:18:49 | Show all posts
1) This is difficult to test. I only have the original PSU.
Not sure if this is an indicator againt being a PSU issue, but as said I had also crashes with low utilization (in proxmox).
Only during bootup CPU usage goes above 20%. Besides that 5-12%.
2) Everything is setup correctly. I had similar (or even worse) issues before moving from windows 11 host to proxmox host.
Win11 was first VirtualBox (only 2 VMs) and later VMWare (also 2 VMs).
3) Proxmox is up to date. No other logs except those listed above.
Reply Support Not support

Use magic Report

0

Threads

380

Posts

1265

Credits

Moderator

Rank: 7Rank: 7Rank: 7

Credits
1265
Post time 2024-11-04 16:16:32 | Show all posts
The specific error message indicates that "the kernel tried to execute an NX-protected page," which may be caused by malicious code or memory corruption.  

Please try only one RAM inside for each RAM .
see if that works better?
Reply Support Not support

Use magic Report

1

Threads

5

Posts

45

Credits

Newbie

Rank: 1

Credits
45
 Author| Post time 2024-11-07 17:22:32 | Show all posts
I have tested both RAM modules separately (said so in the first post; maybe not clear enough). Same behaviour.

Right now the system is running for almost 9 days, with no crash/freeze. There was one error:
  1. Nov 03 09:48:12 kernel: mce: [Hardware Error]: Machine check events logged
  2. Nov 03 09:48:12 kernel: [Hardware Error]: Corrected error, no action required.
  3. Nov 03 09:48:12 kernel: [Hardware Error]: CPU:2 (17:68:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000000010859
  4. Nov 03 09:48:12 kernel: [Hardware Error]: Error Addr: 0x0000000484cbccc0
  5. Nov 03 09:48:12 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000005a020300
  6. Nov 03 09:48:12 kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 1
  7. Nov 03 09:48:12 kernel: [Hardware Error]: cache level: L1, mem/io: IO, mem-tx: IRD, part-proc: SRC (no timeout)
Copy the Code
I have increased the initial fan power and lowered the temperature when the fan starts to spin.
Load is still low (max ~12-15%).

Right now it looks like a temperature issue, but I would expect that the CPU would throttle in such cases and not generate errors.

Reply Support Not support

Use magic Report

0

Threads

380

Posts

1265

Credits

Moderator

Rank: 7Rank: 7Rank: 7

Credits
1265
Post time 2024-11-08 15:25:32 | Show all posts
ZTHawk replied at 2024-11-07 17:22
I have tested both RAM modules separately (said so in the first post; maybe not clear enough). Same  ...

HI there
The error was corrected already.

Proxmox Hardware Monitoring Tools:
Proxmox supports monitoring hardware status through plugins. You can install and enable hardware monitoring tools such as lm-sensors, smbios, etc., to track CPU temperature, fan status, and other information in real-time, ensuring that resources in the virtualization environment are not overburdened.

Ensure that Proxmox System and Drivers Are Up to Date:
Make sure that both your Proxmox system and its drivers are running the latest versions. Sometimes, older versions of Proxmox or drivers may not fully support new hardware, which could affect the performance and hardware temperature control.
Reply Support Not support

Use magic Report

1

Threads

5

Posts

45

Credits

Newbie

Rank: 1

Credits
45
 Author| Post time 2024-11-12 00:10:45 | Show all posts
Edited by ZTHawk at 2024-11-11 22:34

Crashed again with the very long error. Again alsmost no load.
No log in Proxmox. It was only visible on monitor (but no keyboard input was possible).

Running now a memtest86+.

Proxmox is up to date.
1) Is there any link to "Proxmox Hardware Monitoring Tools"?
2) What do you mean with "The error was corrected already"?
3) Just in case: Can you provide me the latest BIOS?
  1. 99.223330208A0
  2. SN:A57003AF00269
  3. SER5 PRO-E-321TBEJ0W64PRO-DP/XB
Copy the Code

Update: memtest86+ with no errors (4 passes)
Reply Support Not support

Use magic Report

0

Threads

380

Posts

1265

Credits

Moderator

Rank: 7Rank: 7Rank: 7

Credits
1265
Post time 2024-11-12 16:28:46 | Show all posts
ZTHawk replied at 2024-11-12 00:10
Crashed again with the very long error. Again alsmost no load.
No log in Proxmox. It was only visibl ...

Hi there
The previous code shows one of the errors corrected ,no action required .

For Proxmox Hardware Monitoring Tools:
1.
. Prometheus + Grafana
Prometheus : https://prometheus.io/
Grafana : https://grafana.com/
Prometheus Exporters( node_exporter):https://github.com/prometheus/node_exporter
Prometheus + Grafana installation and configuration :
https://www.digitalocean.com/com ... ana-on-ubuntu-20-04

2.
. Nagios
Nagios :https://www.nagios.org/
Nagios plug-in (hardware monitoring ):https://exchange.nagios.org/directory/Plugins
IPMI plug-in :https://exchange.nagios.org/dire ... evices/IPMI/details
Nagios tutorial  :https://nagios.sourceforge.io/docs/

For BIOS
Please send us a picture of SN number and BIOS version .
So we would be able to know whether it is latest BIOS and send you files accordingly.
Here`s how to check your BIOS version:
https://mega.nz/#F!yuISGa4I!s1bQQajKwnsEdzjqq4nopQ 
Reply Support Not support

Use magic Report

1

Threads

5

Posts

45

Credits

Newbie

Rank: 1

Credits
45
 Author| Post time 2024-11-12 18:38:33 | Show all posts
Edited by ZTHawk at 2024-11-12 12:02

See my post: there are all the info from the back

But anyway, here is a screenshot

20241111_191925.jpg
Reply Support Not support

Use magic Report

1

Threads

443

Posts

1423

Credits

Moderator

Rank: 7Rank: 7Rank: 7

Credits
1423
Post time 2024-11-13 16:11:51 | Show all posts
ZTHawk replied at 2024-11-12 18:38
See my post: there are all the info from the back

But anyway, here is a screenshot

Hello there,
this is the latest bios:https://url.bee-link.cn/nktG
this is the tutorial:https://url.bee-link.cn/19YO
Reply Support Not support

Use magic Report

You have to log in before you can reply Login | Sign up

Points Rules

Quick Reply To Top Return to the list