Trying to understand the different selfhosted monitoring solutions

dr_robot@kbin.social · 3 years ago

Trying to understand the different selfhosted monitoring solutions

Capillary7379@lemmy.world · 3 years ago

I opted for checkmk as well and don’t want to switch. It’s got a good default for Linux monitoring and it will tell me about random things to fix after reboots, or that memory/disc is getting low so I can fix it quickly.

When monitoring 15 virtual machines on one physical the default of checking every minute for all machines raised the temp over 80 degrees Celsius on the physical machine and triggered a warning. Checking every five minutes is more that I need, so I went with that change.

Dran@lemmy.world · 3 years ago

I have a little 4 core/ 8gb ram VM running my work instance that monitors over a thousand clients on 60s check intervals, you may want to look into your config. I honestly have no idea what could cause 15 machines to cost that much computationally

Capillary7379@lemmy.world · 3 years ago

Sorry, 70 degrees, not 80. The load was fine. It’s a machine to test things, but I kept using checkmk since I really liked it. All on one server, both monitor server and all clients.It’s an old workstation - it runs around 60 degrees normally.

That said, it could very much be a config issue, I installed with the ansible role and left most everything as default. A very easy installation, and with ansible very easy to add new hosts to monitor as well. I’m up to 36 now, including some docker containers.

I switched back to 1 minute to test, and is warned for temp within 20 minutes, from 60 degrees to hovering around 70. Load from 2 to 3.5, threads from 1k to 1.2k all on the physical side. There’s also a small change in IO that seems to be the checkmk server writing more to disk - the cpu on that host is only slighty.

I’m guessing that the temp going over is hardware related, a better fan might fix that issue.

I don’t know if the load/thread increase is reasonable, but given the amount of checks done in the agent I’m perfectly OK with giving those resources to have all the data points checkmk collects available. It’s helped a lot being able to go into details to see what’s going on, checkmk makes that so easy.

3 years ago

That’s odd. I’m currently monitoring 17 vms on one host along with a handful of physical devices. Nothing like the issues you’ve encountered has happened.