Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Help about MediaWiki
FUTO
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Introduction to a Self Managed Life: a 13 hour & 28 minute presentation by FUTO software
(section)
Main Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
= Have your server email you when a hard drive is dying. = There is one caveat that makes ZFS & RAID functionally ''useless'' for many of its users.. 99% of the population don’t know their drive is failing until things start crashing and working horribly slow. By then, it’s usually too late. You’re heading to Rossmann Repair for data recovery. Then they think, ''“if I use RAID, I’m good! One drive can fail and it’ll still work!!!”'' No. You could have RAID 1 with 20 discs and it still wouldn’t matter, because ''NOBODY WHO HAS A LIFE CHECKS THE HEALTH OF THEIR DISK DRIVE EVERY DAY.'' If you only check your drive health when it fails, then RAID 1 with 5 disks is useless. You’re still only going to check it when the fifth one starts failing. <span id="step-1-setting-up-postfix-email-system-on-ubuntu-server-24.04"></span> == Step 1: Setting Up Postfix Email System on Ubuntu Server 24.04 == <span id="install-required-packages"></span> ==== 1.1 Install Required Packages ==== <pre>sudo apt update sudo apt install postfix libsasl2-modules mailutils -y</pre> When prompted during install: * Choose '''“Internet Site”''' for configuration type * Enter your system’s fully qualified domain name when asked of where we are sending emails from, in our case it is <code>home.arpa</code> * Recipient for root & postmaster mail will be the email you want to receive that at, for me I set it as the same email as ZFS alerts which is <code>l.a.rossmann@gmail.com</code> for me * Set '''“Force synchronous updates on mail queue?”''' to no <gallery mode="packed-hover" heights=250 widths=400 perrow=2> File:lu67917r1ezu_tmp_ff734222.png File:lu67917r1ezu_tmp_667e9c06.png File:lu67917r1ezu_tmp_f9f6cd56.png File:lu67917r1ezu_tmp_5c8e2e53.png File:lu67917r1ezu_tmp_b07ae624.png </gallery> <span id="configure-main-postfix-configuration---this-is-similar-to-what-we-did-for-freepbx-voicemail-alerts-in-the-previous-section"></span> ==== 1.2 Configure Main Postfix Configuration - this is similar to what we did for FreePBX voicemail alerts in the previous section ==== <ol style="list-style-type: decimal;"> <li><p>Backup existing configuration:</p> <pre>sudo cp /etc/postfix/main.cf /etc/postfix/main.cf.backup</pre></li> <li><p>Create new <code>main.cf</code>:</p> <pre>sudo nano /etc/postfix/main.cf</pre></li> <li><p>Copy and paste the provided configuration template if you need, and edit the <code>yourdriveisdead@stevesavers.com</code> email in the configuration file with the email you wish to have Postfix use to send you an email.</p></li></ol> <pre> # See /usr/share/postfix/main.cf.dist for a commented, more complete version # Debian specific: Specifying a file name will cause the first # line of that file to be used as the name. The Debian default # is /etc/mailname. #myorigin = /etc/mailname smtpd_banner = $myhostname ESMTP $mail_name (Debian/GNU) biff = no # appending .domain is the MUA's job. append_dot_mydomain = no # Uncomment the next line to generate "delayed mail" warnings #delay_warning_time = 4h readme_directory = no # See http://www.postfix.org/COMPATIBILITY_README.html -- default to 3.6 on # fresh installs. compatibility_level = 3.6 # TLS parameters smtpd_tls_cert_file=/etc/ssl/certs/ssl-cert-snakeoil.pem smtpd_tls_key_file=/etc/ssl/private/ssl-cert-snakeoil.key smtpd_tls_security_level=may smtp_tls_CApath=/etc/ssl/certs smtp_tls_security_level=may smtp_tls_session_cache_database = btree:${data_directory}/smtp_scache smtpd_relay_restrictions = permit_mynetworks permit_sasl_authenticated defer_unauth_destination myhostname = debian.home.arpa alias_maps = hash:/etc/aliases alias_database = hash:/etc/aliases mydestination = $myhostname, debian, localhost.localdomain, localhost relayhost = [smtp.postmarkapp.com]:587 smtp_use_tls = yes smtp_sasl_auth_enable = yes smtp_sasl_password_maps = hash:/etc/postfix/sasl_passwd smtp_sasl_security_options = noanonymous smtp_sasl_mechanism_filter = plain sender_canonical_maps = static:yourdriveisdead@stevesavers.com mynetworks = 127.0.0.0/8 [::ffff:127.0.0.0]/104 [::1]/128 mailbox_size_limit = 0 recipient_delimiter = + # WARNING: Changing the inet_interfaces to an IP other than 127.0.0.1 may expose Postfix to external network connections. # Only modify this setting if you understand the implications and have specific network requirements. inet_interfaces = 127.0.0.1 inet_protocols = all message_size_limit = 102400000</pre> <span id="set-up-smtp-authentication-and-use-your-usernamespasswordsemails-to-replace-mine"></span> ==== 1.3 Set Up SMTP Authentication, and use your usernames/passwords/emails to replace mine ==== <ol style="list-style-type: decimal;"> <li><p>Create the SASL password file:</p> <pre>sudo nano /etc/postfix/sasl_passwd</pre></li> <li><p>Add this line to the file, replacing the username & password with your credentials from postmark:</p></li></ol> <pre>[smtp.postmarkapp.com]:587 1788dd83-9917-46e1-b90a-3b9a89c10bd7:1788dd83-9917-46e1-b90a-3b9a89c10bd7</pre> <ol start="3" style="list-style-type: decimal;"> <li><p>Set proper permissions for security:</p> <pre>sudo chmod 600 /etc/postfix/sasl_passwd</pre></li> <li><p>Create the hash database file:</p> <pre>sudo postmap /etc/postfix/sasl_passwd</pre></li></ol> <span id="restart-and-test"></span> ==== 1.4 Restart and Test ==== <ol style="list-style-type: decimal;"> <li><p>Restart Postfix:</p> <pre>sudo systemctl restart postfix</pre></li> <li><p>Verify Postfix is running:</p> <pre>sudo systemctl status postfix</pre></li> <li><p>Test the email setup:</p> <pre>echo "Test email from $(hostname)" | mail -s "Test Email" l.a.rossmann@gmail.com</pre></li></ol> <gallery mode="packed-hover" heights=250 widths=400 perrow=2> File:lu67917r1ezu_tmp_db123f98.png File:lu67917r1ezu_tmp_cf91d8ae.png </gallery> '''Verification Steps:''' <ol style="list-style-type: decimal;"> <li><p>Check mail logs for errors:</p> <pre>sudo tail -f /var/log/mail.log</pre></li> <li><p>Verify permissions:</p> <pre>ls -l /etc/postfix/sasl_passwd*</pre> <p>Should show:</p> <ul> <li><code>-rw------- 1 root root</code> for sasl_passwd</li> <li><code>-rw------- 1 root root</code> for sasl_passwd.db</li></ul> </li></ol> <span id="troubleshooting-1"></span> === Troubleshooting: === If emails aren’t being sent: <ol style="list-style-type: decimal;"> <li><p>Check Postfix status:</p> <pre>sudo systemctl status postfix</pre></li> <li><p>View detailed mail logs:</p> <pre>sudo journalctl -u postfix</pre></li></ol> Check mail logs for errors: <pre>sudo tail -f /var/log/mail.log</pre> # Check <code>/var/log/mail.log</code> for errors # Check that Postmark credentials are correct (e.g., if you typed <code>postmark.com</code> instead of <code>postmarkapp.com</code> for server, etc.) # Verify sender domain (<code>stevesavers.com</code>) is properly configured in Postmark # Check the '''Activity''' tab on the transactional stream in Postmark # Mail log will tell you what you fkd up 99% of the time. '''This setup does as follows:''' * Send FROM: ''[mailto:yourdriveisdead@stevesavers.com yourdriveisdead@stevesavers.com]'' * Send TO: ''[mailto:l.a.rossmann@gmail.com l.a.rossmann@gmail.com]'' * Use the configured SMTP relay * Include proper authentication The system is now ready for the next step in the ZFS monitoring setup. <span id="step-2-creating-complete-zfs-monitoring-script-with-logging"></span> == Step 2: Creating Complete ZFS Monitoring Script with Logging == <span id="create-log-directory"></span> ==== 2.1 Create Log Directory ==== <pre>sudo mkdir -p /var/log/zfs-monitor sudo chown root:root /var/log/zfs-monitor sudo chmod 755 /var/log/zfs-monitor</pre> <span id="make-the-monitoring-script"></span> ==== 2.2 Make the Monitoring Script ==== <pre>sudo -u root nano /root/zfs_health_check.sh</pre> Copy and paste this complete script: <pre>#!/bin/bash # Configuration EMAIL="l.a.rossmann@gmail.com" HOSTNAME=$(hostname) LOG_FILE="/var/log/zfs-monitor/health_check.log" LOG_MAX_SIZE=$((10 * 1024 * 1024)) # 10MB in bytes # Email configuration FROM_EMAIL="yourdriveisdead@stevesavers.com" FROM_NAME="Steve" REPLY_TO="Steve <steve@stevesavers.com>" # Use a more consistent Reply-To address RETURN_PATH="bounce@stevesavers.com" # A safe Return-Path address to handle bounces properly # Create required directories mkdir -p "$(dirname "$LOG_FILE")" # Initialize error log errors="" # Logging functions rotate_log() { if [ -f "$LOG_FILE" ] && [ $(stat -f%z "$LOG_FILE" 2>/dev/null || stat -c%s "$LOG_FILE") -gt "$LOG_MAX_SIZE" ]; then mv "$LOG_FILE" "$LOG_FILE.old" fi } log_message() { echo -e "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE" } log_error() { local message="$1" errors="${errors}\n$message" log_message "ERROR: $message" } # Check overall pool status check_pool_status() { while IFS= read -r pool; do status=$(zpool status "$pool") # Check for common failure keywords if echo "$status" | grep -E "DEGRADED|FAULTED|OFFLINE|UNAVAIL|REMOVED|FAIL|DESTROYED|SUSPENDED" > /dev/null; then log_error "ALERT: Pool $pool is not healthy:\n$status" fi # Check for errors if echo "$status" | grep -v "No known data errors" | grep -i "errors:" > /dev/null; then log_error "ALERT: Pool $pool has errors:\n$status" fi # Check scrub status if echo "$status" | grep "scan" | grep -E "scrub canceled|scrub failed" > /dev/null; then log_error "ALERT: Pool $pool has unusual scrub status:\n$(echo "$status" | grep "scan")" fi done < <(zpool list -H -o name) } # Check individual device status check_devices() { while IFS= read -r pool; do devices=$(zpool status "$pool" | awk '/ONLINE|DEGRADED|FAULTED|OFFLINE|UNAVAIL|REMOVED/ {print $1,$2}') echo "$devices" | while read -r device state; do if [ "$state" != "ONLINE" ] && [ "$device" != "pool" ] && [ "$device" != "mirror" ] && [ "$device" != "raidz1" ] && [ "$device" != "raidz2" ]; then log_error "ALERT: Device $device in pool $pool is $state" fi done done < <(zpool list -H -o name) } # Check capacity threshold (80% by default) check_capacity() { while IFS= read -r pool; do capacity=$(zpool list -H -p -o capacity "$pool") if [ "$capacity" -ge 80 ]; then log_error "WARNING: Pool $pool is ${capacity}% full" fi done < <(zpool list -H -o name) } # Check dataset properties check_dataset_properties() { while IFS= read -r dataset; do # Skip base pools if ! echo "$dataset" | grep "/" > /dev/null; then continue fi # Check if compression is enabled compression=$(zfs get -H compression "$dataset" | awk '{print $3}') if [ "$compression" = "off" ]; then log_error "WARNING: Compression is disabled on dataset $dataset" fi # Check if dataset is mounted mounted=$(zfs get -H mounted "$dataset" | awk '{print $3}') if [ "$mounted" = "no" ]; then log_error "WARNING: Dataset $dataset is not mounted" fi # Check available space available=$(zfs get -H available "$dataset" | awk '{print $3}') if [ "$available" = "0" ] || [ "$available" = "0B" ]; then log_error "CRITICAL: Dataset $dataset has no available space" fi done < <(zfs list -H -o name) } # Function to send email send_email() { local subject="$1" local content="$2" { echo "Subject: $subject" echo "To: ${EMAIL}" echo "From: ${FROM_NAME} <${FROM_EMAIL}>" echo "Reply-To: ${REPLY_TO}" echo "Return-Path: ${RETURN_PATH}" echo "Content-Type: text/plain; charset=UTF-8" echo echo "$content" } | sendmail -t } # Main execution rotate_log log_message "Starting ZFS health check" # Run all checks check_pool_status check_devices check_capacity check_dataset_properties # Send notification if there are errors if [ -n "$errors" ]; then log_message "Issues detected - sending email alert" subject="Storage Alert: Issues Detected on ${HOSTNAME}" # Simplified subject line content=$(echo -e "ZFS Health Monitor Report from ${HOSTNAME}\n\nThe following issues were detected:${errors}") send_email "$subject" "$content" else log_message "All ZFS checks passed successfully" fi</pre> <span id="set-proper-permissions-2"></span> ==== 2.3 Set Proper Permissions ==== <pre>sudo -u root chmod +x /root/zfs_health_check.sh</pre> <span id="test-the-script"></span> ==== 2.4 Test the Script ==== <pre>sudo /root/zfs_health_check.sh</pre> <span id="make-sure-logging-works"></span> ==== 2.5 Make sure logging works ==== <pre>tail -f /var/log/zfs-monitor/health_check.log</pre> <span id="features-of-this-script"></span> ==== 2.6 Features of this Script: ==== * '''Monitoring''': ** It tells you when your pool has issues BEFORE all your drives die ** Device status checks ** Capacity warnings * '''Email Alerts''': ** Sends when issues are detected ** Includes error information The script is now ready for cron job configuration and regular use. Cron jobs are tasks we tell the machine to perform at regular intervals, similar to setting a utility bill to autopay. <span id="step-3-create-cron-job"></span> == Step 3: Create Cron Job == <ol style="list-style-type: decimal;"> <li><p>Open root’s crontab:</p> <pre>sudo crontab -e</pre></li> <li><p>Add these lines:</p> <pre># ZFS Health Check - Run every 15 minutes */15 * * * * /root/zfs_health_check.sh >/dev/null 2>&1 # Log rotation - Run daily at midnight 0 0 * * * find /var/log/zfs-monitor -name "*.old" -mtime +7 -delete</pre></li></ol> <span id="step-4-verify-it-works-again-just-because"></span> == Step 4: Verify it works again, just because == Run the script manually to ensure it works: <pre>sudo /root/zfs_health_check.sh</pre> <span id="check-logs"></span> === Check Logs === Monitor the log file for any issues: <pre>tail -f /var/log/zfs-monitor/health_check.log</pre> <span id="make-sure-cron-job-is-listed"></span> === Make sure Cron Job is listed === Verify that the cron job is correctly listed: <pre>sudo crontab -l</pre> <span id="test-email-notifications"></span> === Test Email Notifications === # Unplug a drive. # Wait. # Does an email come through? The monitoring system is now fully configured and will: * Check ZFS status every 15 minutes * Log all checks to <code>/var/log/zfs-monitor/health_check.log</code> * Automatically rotate logs when they reach 10MB * Send email alerts only when issues are detected * Clean up old log files after 7 days <span id="how-to-tell-if-you-won"></span> == How to tell if you won: == * ✓ Test email received * ✓ Script detects simulated issues * ✓ Cron job executes on schedule * ✓ Logs show proper entries * ✓ Alerts generated for pool degradation * ✓ System returns to normal after tests If you got an email, congrats, it works! <span id="step-5-set-up-os-raid-array-to-email-you-when-theres-a-problem-as-well"></span> == Step 5: Set up OS RAID Array to email you when there’s a problem as well == What we set up above is for your '''''ARCHIVE''''' storage. What about your operating system? We will do the same thing, and also go over a barbaric backup routine that works for me. <span id="creating-the-alert-script"></span> ==== 5.1 Creating the alert script ==== I’m not a programmer, so bear with me. This script is for my personal use, but I’m sharing it because it works. Here’s what you need to do: # '''Edit Email Addresses''': You’ll need to change the email addresses in the script. This includes: #* The recipient email #* The sender email #* The reply-to address #* The return path for bounced emails # '''Script Location''': Save the script at <code>root/mdadm_alert.sh</code> <pre>sudo -u root nano -w /root/mdadm_alert.sh</pre> Enter the following: <pre>#!/bin/bash # thank you to stack overflow for giving me the courage to wade through 100s of posts and hack together something that looks like it works. # stricter error handling set -euo pipefail # ‘set -e’ exits on errors, ‘u’ throws errors on unset variables, & ‘pipefail’ exits if any part of a pipeline fails IFS=$'\n\t' # Set IFS (Internal Field Separator) to newline & tab to avoid issues with spaces and other weird characters in filenames # Configuration variables (where settings are stored) EMAIL="l.a.rossmann@gmail.com" # Email to send alerts to - EDIT THIS HOSTNAME=$(hostname) # Pull the system's hostname dynamically and save it here LOG_DIR="/var/log/mdadm-monitor" # Directory path for where logs go LOG_FILE="${LOG_DIR}/raid_health_check.log" # Full path to the specific log file for RAID checks LOG_MAX_SIZE=$((10 * 1024 * 1024)) # Maximum log file size in bytes (10 MB here) # Email configuration for the alert message FROM_EMAIL="yourdriveisdead@stevesavers.com" # The email address that will appear as the sender - EDIT THIS FROM_NAME="Steve" # name of the sender, EDIT THIS REPLY_TO="Steve <steve@stevesavers.com>" # Reply-to email address, EDIT THIS RETURN_PATH="bounce@stevesavers.com" # Return path for bounced emails when email fails EDIT THIS # make empty variables & associated arrays errors="" # Empty variable to collect error messages drive_health_report="" # Another empty variable to store drive health details declare -A RAID_ARRAYS # array to keep track of RAID arrays we find, indexed by name like "boot" declare -A SMART_SCORES # array to store SMART scores for drives, indexed by rive path # Set up log directory and ensure permissions are correct setup_logging() { # Make the log directory if it doesn’t already exist mkdir -p "$LOG_DIR" || { echo "ERROR: Cannot create log directory $LOG_DIR"; exit 1; } # Exit with error if I can’t make the directory chmod 750 "$LOG_DIR" # Set directory permissions to allow owner & group access but not others # Check if the log file exists and exceeds the max size limit if [ -f "$LOG_FILE" ] && [ "$(stat -c%s "$LOG_FILE")" -gt "$LOG_MAX_SIZE" ]; then # ‘stat -c%s’ gives the size in bytes mv "$LOG_FILE" "$LOG_FILE.old" # Archive the old log file by renaming it fi touch "$LOG_FILE" # Create an empty log file if it doesn’t exist chmod 640 "$LOG_FILE" # Set permissions on the log file (read/write for owner, read for group) } # Function for logging messages w/ timestamps log_message() { local timestamp # Make local variable for this timestamp=$(date '+%Y-%m-%d %H:%M:%S') # Generate a timestamp in this specific format echo "[$timestamp] $1" | tee -a "$LOG_FILE" # Output the message with the timestamp to both console & log file } # Function for logging errors (adds them to the error string and logs them as "ERROR") log_error() { local message="$1" # Message passed to this function errors="${errors}\n$message" # Append this message to the errors variable log_message "ERROR: $message" # Log the error with a timestamp } # Check that required (commands) are installed on the system check_dependencies() { log_message "Checking required dependencies..." # Announce the check in the log local missing_deps=() # Initialize an empty array for any missing commands # Loop through each command we need, checking if it’s available for dep in mdadm smartctl lsblk findmnt awk grep dmsetup; do if ! command -v "$dep" &>/dev/null; then # If the command is missing, add it to the array missing_deps+=("$dep") fi done # If the array of missing dependencies isn’t empty, log an error and exit if [ ${#missing_deps[@]} -ne 0 ]; then log_error "Missing required dependencies: ${missing_deps[*]}" # Log missing commands log_error "Install them with: sudo apt-get install mdadm smartmontools util-linux findutils gawk grep dmsetup" exit 1 # Exit with error because we’re missing something we need(find what you need if you're getting this) fi } # Find & detect RAID arrays on this system detect_raid_arrays() { log_message "Detecting RAID arrays..." # Log that we’re looking for RAID arrays # Find all block devices with names like /dev/md0, /dev/md1 (these are RAID arrays like the one you made for the OS & boot) local md_devices md_devices=$(find /dev -name 'md[0-9]*' -type b) # Save this list to the md_devices variable # Loop through each RAID array found and log its details for md_dev in $md_devices; do local array_detail # Temporary variable for array details array_detail=$(mdadm --detail "$md_dev" 2>/dev/null) || continue # Get RAID details; skip if it fails # Extract the RAID array name from the details local array_name array_name=$(echo "$array_detail" | grep "Name" | awk '{print $NF}') # Last word on the "Name" line is the array name # Use the name to decide if this array is for boot or root, then add it to RAID_ARRAYS if [[ "$array_name" == *"bootraid"* ]]; then # Array name contains "bootraid" RAID_ARRAYS["boot"]="$md_dev" # Save the device path with the key "boot" log_message "Found boot array: $md_dev ($array_name)" # Log the found boot array elif [[ "$array_name" == *"osdriveraid"* ]]; then # Array name contains "osdriveraid" RAID_ARRAYS["root"]="$md_dev" # Save the device path with the key "root" log_message "Found root array: $md_dev ($array_name)" # Log the found root array fi done # Check if we actually found both root and boot arrays, and log an error if any are missing if [ -z "${RAID_ARRAYS["boot"]:-}" ] || [ -z "${RAID_ARRAYS["root"]:-}" ]; then # If either key is empty log_error "Failed to detect both boot and root RAID arrays" # Log a general error [ -z "${RAID_ARRAYS["boot"]:-}" ] && log_error "Boot array not found" # Specific message if boot is missing [ -z "${RAID_ARRAYS["root"]:-}" ] && log_error "Root array not found" # Specific message if root is missing return 1 # Return an error code fi # Print out a summary of all arrays found log_message "Detected arrays:" for purpose in "${!RAID_ARRAYS[@]}"; do log_message " $purpose: ${RAID_ARRAYS[$purpose]}" done } # Check the health of a specific RAID array check_array_status() { local array="$1" # The path of the array device local purpose="$2" # Either "boot" or "root" to clarify which array this is # Verify that the array actually exists as a block device if [ ! -b "$array" ]; then log_error "$purpose array device $array does not exist" # Log the missing device return 1 # Return error because we can’t check a nonexistent device fi # Get details about the RAID array and store it in the detail variable local detail detail=$(mdadm --detail "$array" 2>&1) || { # ‘2>&1’ captures error output in case of issues log_error "Failed to get details for $purpose array ($array)" return 1 # Exit with an error code if it failed } # Extract the state of the array (like "clean" or "active") and log it local state state=$(echo "$detail" | grep "State :" | awk '{print $3,$4}') # Get the words after "State :" from the details log_message "$purpose array status: $state" # If the array is in an undesirable state, log a warning if [[ "$state" =~ degraded|DEGRADED|failed|FAILED|inactive|INACTIVE ]]; then log_error "$purpose array ($array) is in concerning state: $state" fi # Detect failed devices within the array local failed_devices failed_devices=$(echo "$detail" | grep "Failed Devices" | awk '{print $4}') # Pull the failed devices count if [ "$failed_devices" -gt 0 ]; then # If there are failed devices, go through each one while read -r line; do if [[ "$line" =~ "faulty" ]]; then # If the line mentions "faulty" local failed_dev failed_dev=$(echo "$line" | awk '{print $7}') # Get the 7th word (the device name) log_error "$purpose array ($array) has failed device: $failed_dev" # Log which device failed fi done < <(echo "$detail" | grep -A20 "Number" | grep "faulty") # Look up to 20 lines after "Number" to find "faulty" fi # Check if any devices are rebuilding, and log it if they are if echo "$detail" | grep -q "rebuilding"; then while read -r line; do if [[ "$line" =~ "rebuilding" ]]; then # Check for "rebuilding" in the line local rebuilding_dev rebuilding_dev=$(echo "$line" | awk '{print $7}') # Get the device name being rebuilt log_error "$purpose array ($array) is rebuilding device: $rebuilding_dev" # Log the rebuilding device fi done < <(echo "$detail" | grep -A20 "Number" | grep "rebuilding") # Again, look ahead 20 lines for any "rebuilding" mention fi } # Function to check the health of each drive within a RAID array check_drive_health() { local drive="$1" # The drive device to check (e.g., /dev/sda) local health_score=100 # Initialize health score to 100 (a perfect score) local issues="" # Skip the check if it’s not a valid block device if [ ! -b "$drive" ]; then log_error "Device $drive is not a block device" # Log the invalid device return 1 # Exit with an error code fi log_message "Checking health of drive $drive..." # Announce which drive we’re checking # Run SMART health check and reduce health score if it fails if ! smartctl -H "$drive" | grep -q "PASSED"; then # If it does NOT say "PASSED" health_score=$((health_score - 50)) # Drop score by 50 points if it fails issues+="\n- Overall health check failed" # Log this specific issue fi # Collect SMART attributes for further checks local smart_attrs smart_attrs=$(smartctl -A "$drive" 2>/dev/null) || true # Redirect error to /dev/null # Check for reallocated sectors (sign of drive wear and tear) local reallocated reallocated=$(echo "$smart_attrs" | awk '/^ 5/ {print $10}') # Look for attribute ID 5 in SMART data if [ -n "$reallocated" ] && [ "$reallocated" -gt 0 ]; then health_score=$((health_score - 10)) # Drop health score by 10 if we have reallocated sectors issues+="\n- Reallocated sectors: $reallocated" # Add to issues list fi # Check for pending sectors (could cause read/write errors) local pending pending=$(echo "$smart_attrs" | awk '/^197/ {print $10}') # Look for attribute ID 197 in SMART data if [ -n "$pending" ] && [ "$pending" -gt 0 ]; then health_score=$((health_score - 10)) # Drop health score by 10 if pending sectors are present issues+="\n- Pending sectors: $pending" # Add to issues list fi SMART_SCORES["$drive"]=$health_score # Save the final score in SMART_SCORES array if [ "$health_score" -lt 100 ]; then drive_health_report+="\nDrive: $drive\nHealth Score: $health_score/100\nIssues:$issues" # Append issues to report if any were found fi } # Send email if any errors or health issues were found send_email() { local subject="RAID Alert: Issues Detected on ${HOSTNAME}" # Set email subject line local content="RAID Health Monitor Report from ${HOSTNAME}\nTime: $(date '+%Y-%m-%d %H:%M:%S')\n" [ -n "$errors" ] && content+="\nRAID Issues:${errors}" # Append RAID issues to the email content if any [ -n "$drive_health_report" ] && content+="\nDrive Health Report:${drive_health_report}" # Append drive health report if any issues were found # Build the email using sendmail syntax { echo "Subject: $subject" echo "To: ${EMAIL}" echo "From: ${FROM_NAME} <${FROM_EMAIL}>" echo "Reply-To: ${REPLY_TO}" echo "Return-Path: ${RETURN_PATH}" echo "Content-Type: text/plain; charset=UTF-8" # Text format for readability echo echo -e "$content" # Use ‘-e’ to allow newline characters } | sendmail -t # Pipe the entire email message to sendmail for delivery } # Main function to execute checks and send email if needed main() { # Make sure script is run as root for necessary permissions [ "$(id -u)" -ne 0 ] && { echo "ERROR: This script must be run as root"; exit 1; } setup_logging # Call function to initialize logging setup log_message "Starting RAID health check" # Announce the start of the health check check_dependencies # Verify dependencies are available detect_raid_arrays # Detect RAID arrays # Loop through each RAID array and check its status, then check each drive in the array for purpose in "${!RAID_ARRAYS[@]}"; do array="${RAID_ARRAYS[$purpose]}" check_array_status "$array" "$purpose" # For each device in the RAID array, check health while read -r device; do if [[ "$device" =~ ^/dev/ ]]; then check_drive_health "$device" fi done < <(mdadm --detail "$array" | grep "active sync" | awk '{print $NF}') done # Send an email if errors or health issues were found; otherwise, log a success message [ -n "$errors" ] || [ -n "$drive_health_report" ] && send_email || log_message "All checks passed successfully" } # Execute the main function to start everything main # Calls the main function, running all the checks</pre> Set permissions properly so it can run: <pre>sudo -u root chmod +x /root/mdadm_alert.sh</pre> <span id="setting-up-the-cron-job"></span> ==== 5.2 Setting Up the Cron Job ==== We want this script to run regularly. I am going to set it to run every 15 minutes. <pre># Open the crontab editor sudo -u root crontab -e</pre> Add the following line to run the script every minute (for testing purposes): <pre>* * * * * /root/mdadm_alert.sh</pre> <blockquote>'''Note:''' For regular use, set it to run every fifteen minutes, with a line such as <code>*/15 * * * * /root/mdadm_alert.sh</code> </blockquote> <span id="testing-the-setup---software-run-first."></span> ==== 5.3 Testing the setup - software run first. ==== Let’s simulate a fault condition on <code>/dev/md126</code> which is what I set up as the RAID1 array for the operating system installation; this is where we created the logical volume for <code>/</code> # Check the status of it as it is now: <pre>sudo mdadm --detail /dev/md126</pre> <ol start="2" style="list-style-type: decimal;"> <li>If it shows up as healthy, run the script to make sure we do not have false positives.</li></ol> <pre>sudo -u root /root/mdadm_alert.sh</pre> <ol start="3" style="list-style-type: decimal;"> <li>If no false positives, simulate fault condition:</li></ol> <pre>sudo mdadm /dev/md126 --fail /dev/sdb3</pre> <code>/dev/sdb3</code> was the drive & partition that was used in my RAID array. Yours may differ, refer to the output of <code>mdadm --detail</code> to see how your RAID array is comprised, and then fail one of the two devices. <ol start="4" style="list-style-type: decimal;"> <li>Run the monitoring script to test again.</li></ol> <pre>sudo -u root /root/mdadm_alert.sh</pre> You should receive an email. Check spam. <ol start="5" style="list-style-type: decimal;"> <li>Undo what you did, un-fail the drive.</li></ol> <pre>sudo mdadm /dev/md126 --remove /dev/sdb3 sudo mdadm /dev/md126 --add /dev/sdb3</pre> <ol start="6" style="list-style-type: decimal;"> <li>Watch it re-sync. Don’t mess with anything until it is fully resynced.</li></ol> <pre>watch cat /proc/mdstat</pre> <span id="testing-the-setup-for-real---hardware-fault."></span> ==== 5.4 Testing the setup for real - hardware fault. ==== Now, let’s test this setup. Unplug one of the drives and see if you get a failure alert. Obviously, don’t do this after you start storing anything important on here. We do this in the build phase of our system to make sure it works, BEFORE trusting this system with anything important. # Check the status of it as it is now: <pre>sudo mdadm --detail /dev/md126</pre> <ol start="2" style="list-style-type: decimal;"> <li>If it shows up as healthy, run the script to make sure we do not have false positives.</li></ol> <pre>sudo -u root /root/mdadm_alert.sh</pre> <ol start="3" style="list-style-type: decimal;"> <li>If no false positives, unplug the drive from the running system.</li></ol> <code>/dev/sdb3</code> was the drive & partition that was used in my RAID array. Yours may differ, refer to the output of <code>mdadm --detail</code> to see how your RAID array is comprised, and then fail one of the two devices. <ol start="4" style="list-style-type: decimal;"> <li>Run the monitoring script to test again.</li></ol> <pre>sudo -u root /root/mdadm_alert.sh</pre> You should receive an email. Check spam. <ol start="5" style="list-style-type: decimal;"> <li>Undo what you did, un-fail the drive after plugging it back in..</li></ol> <pre>sudo mdadm /dev/md126 --remove /dev/sdb3 sudo mdadm /dev/md126 --add /dev/sdb3</pre> <ol start="6" style="list-style-type: decimal;"> <li>Watch it re-sync. Don’t mess with anything until it is fully resynced.</li></ol> <pre>watch cat /proc/mdstat</pre> <span id="step-6-backup-strategy"></span> == Step 6: Backup Strategy == Now, let’s talk about backups. It’s not enough to just have a RAID setup; you need a backup plan for when carelessness strikes. <span id="backup-method"></span> ==== 6.1 Backup Method ==== Here’s my approach: * '''Physical Copy''': I make a physical copy of my disk. This might seem old-school, but it works for me. Another approach: * '''LVM Snapshots''': You can take an LVM snapshot and then use <code>rsync</code> to back up your data. This method can be hit or miss. I don’t use this. You can take a snapshot of your drive with LVM, rsync your files off of the drive elsewhere, reinstall the operating system, and rsync them back, but… what if some of your files are for older libraries, or programs/configuration files that have different syntax with different versions? It can become a rabbit hole to hell very easily, and I’m not going to begin to torture newbies with this. '''DDRescue''' is the tool I use to make a copy of my drive. I connect the drive via a USB 3 to SATA plug and create a backup. It’s best to do this to the same make/model of drive if possible. <span id="ddrescue-guide-from-ubuntu-server-live-environment"></span> ==== 6.2 DDRescue Guide from Ubuntu Server Live Environment ==== We’re going to boot from the same Ubuntu Server LiveUSB image you created to install Ubuntu Server onto the happycloud host machine. * Boot from the USB Drive <gallery mode="packed-hover" heights=250 widths=400 perrow=2> File:lu55028jxc7f_tmp_911d702.png File:lu55028jxc7f_tmp_a33d9a7f.png </gallery> # Insert the USB drive into your server. # Power on the server and enter the boot menu (usually by pressing '''F12''' or another function key). # Select the '''UEFI option''' for your USB drive. # Choose to Try Ubuntu Server & do not install it. * Install ddrescue # Update package list & install ddrescue: <pre>sudo apt update sudo add-apt-repository universe sudo apt install gddrescue</pre> <ol start="2" style="list-style-type: decimal;"> <li>Check Current Drives (BEFORE Plugging in Source)</li></ol> <pre>sudo fdisk -l</pre> Take note of the present drives. <ol start="3" style="list-style-type: decimal;"> <li><p>Connect Source Drive (operating system solid state drive from the happycloud host machine). Either will do. Either connect it physically to an existing SATA/NVME port, or use a USB-SATA or USB-NVME enclosure if this makes it easier for you.</p></li> <li><p>Wait 5-10 seconds. Be patient.</p></li> <li><p>Check which drive it is. It will be the new drive that shows up. Make sure the model as well as the size & partitions matches what you are expecting.</p> <pre>sudo fdisk -l</pre></li> <li><p>Connect Target Drive (blank identical disk you are making into a backup drive)</p></li> <li><p>Wait 5-10 seconds. Be patient.</p></li> <li><p>Check which drive it is. It will be the new drive that shows up. Make sure the model as well as the size & partitions matches what you are expecting.</p> <pre>sudo fdisk -l</pre></li></ol> '''TRIPLE CHECK YOUR DEVICES''' <pre># List all drives again sudo fdisk -l</pre> <ol start="9" style="list-style-type: decimal;"> <li>Run DDRescue</li></ol> <pre>sudo ddrescue -f -d -r3 /dev/source /dev/target logfile.log</pre> For instance, if the source is <code>/dev/sdc</code> & target is <code>/dev/sdd</code>: <pre>sudo ddrescue -f -d -r3 /dev/sdc /dev/sdd logfile.log</pre> Option meanings: - <code>-f</code> : Force overwrite target - <code>-d</code> : Use direct disk access - <code>-r3</code>: Number of retry attempts on bad sectors - logfile.log: Saves progress (can resume if interrupted) '''⚠️ WARNING: ⚠️''' <ol style="list-style-type: decimal;"> <li><p>TRIPLE CHECK device names</p> <ul> <li>Wrong device = destroyed data</li> <li>Source and target reversed = destroyed source</li></ul> </li> <li><p>Target MUST be same size or larger than source</p></li> <li><p>Make sure you’re using whole drives:</p> <ul> <li><code>/dev/sdc</code> (correct, whole drive)</li> <li><code>/dev/sdc1</code> (WRONG, just one partition)</li></ul> </li> <li><p>If unsure which is which, unplug/replug and watch:</p> <pre>sudo dmesg | tail</pre> <p>It will show new devices added to the linux machine</p></li></ol> <blockquote>'''IMPORTANT NOTE:''' Always have a physical copy of a known-working server solid state drive. If something wrong, you can quickly restore your system by plugging in the backup drive and be back up in 90 seconds or less. </blockquote> <span id="raid-configuration-recommendations"></span> == RAID Configuration Recommendations == <ul> <li><p>For those who are extra cautious, consider running a RAID 1 setup with '''three''' drives instead of '''two'''. Here’s why:</p> <ul> <li>'''Redundancy''': When one drive fails, the others are likely not far behind. Having a third drive adds some padding.</li> <li>'''Peace of mind''': If you’re paranoid about data loss, this setup is a safer bet.</li></ul> <p>If you wanted to avoid stressing the SSD, you could create a ZFS dataset on the ZFS pool of hard drives you set up for virtual machines, mount that as <code>/var/lib/libvirt/images/</code>, but I’ve gotten spoiled by the speed of SSDs - I don’t want to go back. I realize that writing to them a lot means killing them, and I’m ok with that. :)</p></li></ul> <span id="os-drive-backup-conclusion"></span> == OS drive backup conclusion: == Once everything is set up the way you like, shut down your system, remove one of the drives, and make a backup. Use a drive of equal or greater size for the backup. This way, if disaster strikes, you can restore your system in no time. We now have a simple & effective way to know when our operating system drive is about to die on us, so we can take action before anything horrible occurs. Best of all, if you set this up properly, you can have zero downtime & not even have to turn off the machine to get back up and running when a drive fails. <span id="setting-up-immich-google-photosicloud-replacement"></span>
Summary:
Please note that all contributions to FUTO may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
FUTO:Copyrights
for details).
Do not submit copyrighted work without permission!
To protect the wiki against automated edit spam, we kindly ask you to solve the following hCaptcha:
Cancel
Editing help
(opens in new window)