summary |
shortlog | log |
commit |
commitdiff |
tree
first ⋅ prev ⋅ next
Alex Dehnert [Mon, 9 Oct 2023 04:32:03 +0000 (04:32 +0000)]
Add new hosts to hostgroups, remove shut down ones
Alex Dehnert [Mon, 9 Oct 2023 04:30:58 +0000 (04:30 +0000)]
masada: Start monitoring it
Long past time -- not sure why I missed it. Really we should make sure the KDC
is operational, but pingable is a start.
Alex Dehnert [Mon, 9 Oct 2023 04:30:20 +0000 (04:30 +0000)]
olinda: olinda is shut down, so disable most notifications
Alex Dehnert [Mon, 9 Oct 2023 04:27:48 +0000 (04:27 +0000)]
bots: Tweak what's monitored
I reinstalled bots, and it prompted me to look closer at monitoring of it. Some
old services no longer work (in some cases I think because Hangouts is shut
down ish, in others perhaps the migration didn't work well), so disable
notifications of those. I've also added some new services, so monitor them.
Alex Dehnert [Mon, 9 Oct 2023 04:26:33 +0000 (04:26 +0000)]
Prefer hostnames over IPs
I feel like I flip-flop on this every couple years, and I'm not entirely sure
what prompted this one (maybe the question of how to reach bots?), but probably
I had a reason.
Alex Dehnert [Mon, 9 Oct 2023 04:33:30 +0000 (04:33 +0000)]
Add monitoring of new hosts
Monitoring of augsburg is very basic, but chankillo is more reasonable (and
heavily based off olinda)
Alex Dehnert [Mon, 9 Oct 2023 04:23:57 +0000 (04:23 +0000)]
Allow some services to have public notifications and others private
Alex Dehnert [Tue, 12 Sep 2023 18:24:00 +0000 (18:24 +0000)]
Replication is being checked locally, so apply to localhost
Otherwise standing up a monitoring service on chankillo led to misleading
outage messages claiming that olinda was broken. Citing "localhost" leaves you
to figure out which host it is, but at least it's obviously unclear.
Alex Dehnert [Sat, 5 Aug 2023 20:24:56 +0000 (20:24 +0000)]
Switch to instanced personals, not classed, for notifications
Zulip doesn't support classed personals, and these days I mostly read on Zulip.
Alex Dehnert [Sat, 15 Jul 2023 02:29:55 +0000 (02:29 +0000)]
Ignore (empty) htdigest.users for now
Alex Dehnert [Sun, 9 Jul 2023 05:05:03 +0000 (05:05 +0000)]
Merge remote-tracking branch 'origin/master' into nagios4
Alex Dehnert [Sun, 9 Jul 2023 05:03:42 +0000 (05:03 +0000)]
nagios4: Update Apache config
Alex Dehnert [Sun, 11 Jun 2023 19:06:51 +0000 (15:06 -0400)]
Add monitoring of backups
Alex Dehnert [Sat, 27 May 2023 18:33:20 +0000 (18:33 +0000)]
Add new stylesheets
Alex Dehnert [Sat, 27 May 2023 06:35:52 +0000 (06:35 +0000)]
Use dehnerts.com hostnames, not mit.edu
We trust the SSH CA for dehnerts.com, not mit.edu, so this avoids host
key verification failed errors.
I think the motivation not to do this was DNS downtime, but hopefully we
can solve that with redundant DNS.
Alex Dehnert [Sat, 27 May 2023 06:35:33 +0000 (06:35 +0000)]
More nagios4 updates
Alex Dehnert [Fri, 26 May 2023 07:58:11 +0000 (07:58 +0000)]
nagios4: Add new objects config, remove old extinfo
Alex Dehnert [Mon, 4 Oct 2021 15:19:42 +0000 (11:19 -0400)]
salt: Update Vault check to run elsewhere
Also, make it run less frequently (we're looking for expiry more than downtime)
and have a higher timeout (it seems to frequently take more than ten seconds).
Alex Dehnert [Sun, 3 Oct 2021 02:41:34 +0000 (22:41 -0400)]
wieliczka: Add a check that Salt can talk to Vault
Alex Dehnert [Mon, 27 Sep 2021 02:59:44 +0000 (22:59 -0400)]
roost-api: Add zephyr/zulip bridge monitoring
Alex Dehnert [Mon, 27 Sep 2021 02:59:27 +0000 (22:59 -0400)]
xidi: Add xidi (adehnert-pi4) monitoring
Alex Dehnert [Sun, 11 Jul 2021 21:40:38 +0000 (17:40 -0400)]
vault: Add a check for seal status
Alex Dehnert [Fri, 9 Jul 2021 00:23:49 +0000 (20:23 -0400)]
vault: Check that the vault server is responding with good cert
Alex Dehnert [Thu, 29 Apr 2021 00:21:34 +0000 (20:21 -0400)]
roost-api: Add check for HTTPS service
Alex Dehnert [Sat, 17 Apr 2021 07:46:00 +0000 (03:46 -0400)]
Send personal zephyrs, now that there's more content
Alex Dehnert [Sat, 17 Apr 2021 04:57:49 +0000 (00:57 -0400)]
Send long output by zephyrs
Alex Dehnert [Sat, 17 Apr 2021 04:56:02 +0000 (00:56 -0400)]
Check for ssh signing less often
We only try to re-sign every three days, so checking every two minutes is
really pretty excessive.
Alex Dehnert [Sun, 4 Apr 2021 04:03:14 +0000 (00:03 -0400)]
Add some more checks on salt minions
Among other things, this fixes sysconfig/salt#16.
Alex Dehnert [Sun, 28 Mar 2021 16:44:02 +0000 (12:44 -0400)]
Configure monitoring of salt minions
Alex Dehnert [Mon, 26 Aug 2019 06:37:36 +0000 (02:37 -0400)]
Add monitoring of dovecot replication
Alex Dehnert [Thu, 27 Jun 2019 05:31:49 +0000 (01:31 -0400)]
ESP has their own monitoring now
Alex Dehnert [Thu, 27 Jun 2019 05:27:32 +0000 (01:27 -0400)]
Switch to new IPs
Alex Dehnert [Sun, 5 May 2019 09:21:59 +0000 (05:21 -0400)]
New Apache/nagios config for xenial (16.04)
Alex Dehnert [Sun, 5 May 2019 03:14:10 +0000 (23:14 -0400)]
nagios config tweaks from upgrading to 16.04
Alex Dehnert [Sat, 20 May 2017 19:07:35 +0000 (15:07 -0400)]
New post-renumbering IP addrs at ET
Alex Dehnert [Sat, 21 Jan 2017 18:47:40 +0000 (13:47 -0500)]
Use novgorod's post-renumbering IP
Alex Dehnert [Tue, 19 Jan 2016 05:48:12 +0000 (00:48 -0500)]
Fix nagios checks
- HTTPS checks should use check_https_hostname, so that we use the hostname's
vhost, not the IP address
- explicitly use .my.cnf for olinda's mysql check, rather than just setting the
home dir (I don't know why that seems to have broken, but it has)
Alex Dehnert [Tue, 19 Jan 2016 01:40:33 +0000 (20:40 -0500)]
Linerva has been dead for ages, so delete the config
Alex Dehnert [Tue, 19 Jan 2016 01:34:25 +0000 (20:34 -0500)]
New config options with Ubuntu 14.04's nagios
Alex Dehnert [Tue, 19 Jan 2016 01:33:50 +0000 (20:33 -0500)]
New stylesheets with Ubuntu 14.04's nagios
Alex Dehnert [Tue, 13 Oct 2015 03:55:09 +0000 (23:55 -0400)]
ESP: add RAID check
Alex Dehnert [Sat, 15 Feb 2014 20:31:29 +0000 (15:31 -0500)]
Set up repeat notifications for my outages
Alex Dehnert [Sat, 15 Feb 2014 18:49:24 +0000 (13:49 -0500)]
Lunatique is no more
Alex Dehnert [Wed, 31 Jul 2013 06:26:18 +0000 (02:26 -0400)]
Check the jabber server
Alex Dehnert [Sat, 6 Jul 2013 18:37:04 +0000 (14:37 -0400)]
Warn on olinda cert expiry only 10 days early
StartSSL doesn't want to renew my cert 14 days early, apparently. :(
Alex Dehnert [Wed, 10 Apr 2013 17:49:41 +0000 (13:49 -0400)]
Bump the large queue threshold
I'm tuning out the 20/40 triggers, which means that I might as well not have
them.
Alex Dehnert [Thu, 28 Mar 2013 03:07:20 +0000 (23:07 -0400)]
Disable notifications on s-b
Ops has switched to a new set of dialups, so s-b will be down for a longish
while yet.
Alex Dehnert [Mon, 26 Nov 2012 07:44:12 +0000 (02:44 -0500)]
Bump the olinda mailq thresholds
Alex Dehnert [Wed, 14 Nov 2012 10:22:18 +0000 (05:22 -0500)]
CGI-related changes?
I think this may be the result of some package upgrade
Alex Dehnert [Wed, 14 Nov 2012 10:21:57 +0000 (05:21 -0500)]
Update ESP config for web access
Alex Dehnert [Sun, 26 Aug 2012 21:02:14 +0000 (17:02 -0400)]
Upstream updates (Lucid->Precise upgrade)
Alex Dehnert [Sat, 28 Jul 2012 19:58:50 +0000 (15:58 -0400)]
Push esp.mit.edu's cert expiry warning to 30 days
This uses "fake" default arguments for check commands. See
http://tracker.nagios.org/print_bug_page.php?bug_id=174 for the bug asking for
a good way to do them, and a workaround for how to do them with current nagios.
Alex Dehnert [Sat, 28 Jul 2012 19:41:08 +0000 (15:41 -0400)]
Notify ESP by email, too
Alex Dehnert [Thu, 26 Jul 2012 05:17:28 +0000 (01:17 -0400)]
Monitor SSL on olinda
Alex Dehnert [Thu, 26 Jul 2012 05:16:58 +0000 (01:16 -0400)]
Monitor lunatique pingability
Alex Dehnert [Sat, 31 Mar 2012 07:57:00 +0000 (03:57 -0400)]
Ignore brief spikes in disk due to backups(?)
Alex Dehnert [Fri, 9 Mar 2012 17:44:42 +0000 (12:44 -0500)]
Monitor linerva (especially disk)
Alex Dehnert [Sat, 1 Oct 2011 18:55:27 +0000 (14:55 -0400)]
Require more failed checks before alerting
Alex Dehnert [Sat, 1 Oct 2011 18:55:09 +0000 (14:55 -0400)]
Send zephyrs about esp to -c esp-auto, not -c esp
Alex Dehnert [Fri, 10 Jun 2011 01:30:34 +0000 (21:30 -0400)]
Increase the threshold for postfix queue alerts
Alex Dehnert [Mon, 25 Apr 2011 07:47:33 +0000 (03:47 -0400)]
Add Exim queue checks
Unfortunately, Exim doesn't allow non-admins to see the queues,
which makes it hard to actually use this check.
Alex Dehnert [Mon, 25 Apr 2011 07:25:42 +0000 (03:25 -0400)]
Check that can connect to services on esp.mit.edu
Alex Dehnert [Sat, 23 Apr 2011 07:27:56 +0000 (03:27 -0400)]
Fix MySQL check
Alex Dehnert [Sat, 23 Apr 2011 07:08:07 +0000 (03:08 -0400)]
Expand monitoring of olinda
Add monitoring of:
* Postfix (SMTP connections, queue size)
* Dovecot (IMAPS connections)
* BIND (DNS port, I think)
* MySQL
Alex Dehnert [Wed, 20 Apr 2011 19:33:53 +0000 (15:33 -0400)]
Reconfigure esp so notification go to -c esp
Alex Dehnert [Wed, 20 Apr 2011 19:22:38 +0000 (15:22 -0400)]
Add NRPE config
Alex Dehnert [Wed, 20 Apr 2011 19:21:53 +0000 (15:21 -0400)]
Monitor esp.mit.edu
Alex Dehnert [Wed, 20 Apr 2011 17:33:25 +0000 (13:33 -0400)]
Check every 2 minutes
Alex Dehnert [Wed, 20 Apr 2011 17:32:41 +0000 (13:32 -0400)]
Ignore the htpasswd file
Alex Dehnert [Wed, 20 Apr 2011 17:32:13 +0000 (13:32 -0400)]
Watch my machines --- novgorod, olinda
Alex Dehnert [Wed, 20 Apr 2011 17:31:47 +0000 (13:31 -0400)]
Watch scrubbing-bubbles
Alex Dehnert [Wed, 20 Apr 2011 17:31:18 +0000 (13:31 -0400)]
Add contacts (zephyr)
Alex Dehnert [Wed, 20 Apr 2011 17:30:19 +0000 (13:30 -0400)]
Add directory for site-specific config
This won't separate it all out, but it'll make me feel a bit better.
Alex Dehnert [Wed, 20 Apr 2011 17:29:25 +0000 (13:29 -0400)]
Fix line break in zephyr command
Alex Dehnert [Wed, 20 Apr 2011 15:46:43 +0000 (11:46 -0400)]
Add zephyr config
Alex Dehnert [Wed, 20 Apr 2011 15:44:46 +0000 (11:44 -0400)]
Add upstream nagios config