• PromQL을 이용해서 Alert Rule을 작성하면, 알람을 만들 수 있음
  • Prometheus에서 만들 알람에 대한 정보를 저장할 파일을 rule file이라고 함
  • rule file은 prometheus.yml과 같은 위치에 rules 디렉토리를 만들고 규칙을 작성할 Job과 같은 이름으로 작성
  • 알람이 많아질수록 모니터링 요소가 많아져 Prometheus에 부하가 발생 → 카디널리티를 줄이고 시계열을 쿼리하는 비용을 최적화하기 위해서 Recording Rule이라는 것을 이용

hardware_rule.yml 파일에 중요한 요소 → Prometheus 알람을 설정하는 코드

  1. record → Recording Rule으로 해당 값으로 설정한 이름으로 쿼리할 때 사용
  2. alert → 알람 발생 이름
  3. expr → Prometheus는 expr에 해당하는 쿼리를 계속 감시, PromQL을 하나의 시계열로 생성
  4. labels → 발생한 알람의 labeling
  5. for → 지정한 시간 동안 조건을 충족하게 되면 알림을 발생

1. DELL 서버 hardware 알람 rule의 yaml 파일 내용

$ mkdir -p /etc/prometheus/rules/

$ vi /etc/prometheus/rules/hardware_rule.yml
groups:
- name: hardware
  rules:
  - record: job:chassis
    expr: dell_hw_scrape_collector_success{collector="chassis"}

  - record: job:hardware_log
    expr: dell_hw_chassis_status{component="Hardware_Log"}

  - record: job:intrusion
    expr: dell_hw_chassis_status{component="Intrusion"}

  - record: job:chassis_batteries
    expr: dell_hw_scrape_collector_success{collector="chassis_batteries"}

  - record: job:batteries_status
    expr: dell_hw_chassis_status{component="Batteries"}

  - record: job:power_managemet
    expr: dell_hw_chassis_status{component="Power_Management"}

  - record: job:power_supplies
    expr: dell_hw_chassis_status{component="Power_Supplies"}

  - record: job:fans
    expr: dell_hw_scrape_collector_success{collector="fans"}

  - record: job:firmwares
    expr: dell_hw_scrape_collector_success{collector="firmwares"}

  - record: job:memory
    expr: dell_hw_scrape_collector_success{collector="memory"}

  - record: job:memory_status
    expr : dell_hw_chassis_memory_status{memory=~"DIMM_.*"}

  - record: job:nics
    expr: dell_hw_scrape_collector_success{collector="nics"}

  - record: job:processors
    expr: dell_hw_scrape_collector_success{collector="processors"}

  - record: job:cpu_status
    expr: dell_hw_chassis_processor_status{processor=~"CPU.*"}

  - record: job:ps
    expr: dell_hw_scrape_collector_success{collector="ps"}

  - record: job:ps_amps_sysboard_pwr
    expr: dell_hw_scrape_collector_success{collector="ps_amps_sysboard_pwr"}

  - record: job:storage_battery
    expr: dell_hw_scrape_collector_success{collector="storage_battery"}

  - record: job:storage_controller
    expr: dell_hw_scrape_collector_success{collector="storage_controller"}

  - record: job:storage_enclosure
    expr: dell_hw_scrape_collector_success{collector="storage_enclosure"}

  - record: job:storage_pdisk
    expr: dell_hw_scrape_collector_success{collector="storage_pdisk"}

  - record: job:storage_vdisk
    expr: dell_hw_scrape_collector_success{collector="storage_vdisk"}

  - record: job:system
    expr: dell_hw_scrape_collector_success{collector="system"}

  - record: job:temps
    expr: dell_hw_scrape_collector_success{collector="temps"}

  - record: job:temperatures
    expr: dell_hw_chassis_status{component="Temperatures"}

  - record: job:volts
    expr: dell_hw_scrape_collector_success{collector="volts"}

  - record: job:voltages
    expr: dell_hw_chassis_status{component="Voltages"}

  - alert: HardwareErrorandWarning
    expr: |
      ( job:chassis ) < 1 or
      ( job:hardware_log ) >= 1 or
      ( job:intrusion ) >= 1 or
      ( job:chassis_batteries ) < 1 or
      ( job:batteries_status ) >= 1 or
      ( job:power_managemet ) >= 1 or
      ( job:power_supplies ) >= 1 or
      ( job:fans ) < 1 or
      ( job:firmwares ) < 1 or
      ( job:memory ) < 1 or
      ( job:memory_status ) >= 1 or
      ( job:nics ) < 1 or
      ( job:processors ) < 1 or
      ( job:cpu_status ) >= 1 or
      ( job:ps ) < 1 or
      ( job:ps_amps_sysboard_pwr ) < 1 or
      ( job:storage_battery ) < 1 or
      ( job:storage_controller ) < 1 or
      ( job:storage_enclosure ) < 1 or
      ( job:storage_pdisk ) < 1 or
      ( job:storage_vdisk ) < 1 or
      ( job:system ) < 1 or
      ( job:temps ) < 1 or
      ( job:temperatures ) >= 1 or
      ( job:voltages ) >= 1 or
      ( job:volts ) < 1
    for: 30m
    labels:
      hardware_status: 'critical'
    annotations:
      summary: "{{ $labels.instance }}'s {{ $labels.collector }}  Hardware Error"

2. 디스크 관련 알람 rule의 yaml 파일 내용

cat /etc/prometheus/rules/disk_rule.yml
groups:
- name: disk-alarm
  rules:
  - record: job:pdisk
    expr: dell_hw_storage_pdisk_status{disk=~".*"}
  - record: job:vdisk
    expr: dell_hw_storage_vdisk_status{vdisk=~".*"}

  - alert: DiskErrorandWarning
    expr: |
      ( job:pdisk ) >= 1 or
      ( job:vdisk ) >= 1
    for: 30m
    labels:
      physical_disk_status: 'critical'
    annotations:
      summary: "{{ $labels.instance }}'s Disk Error"



Prometheus에서 알람을 정의

  • Prometheus 설정 파일인 promethes.yml에 알람 yml 파일 추가

    $ vi /etc/prometheus/prometheus.yml
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        cluster: prometheus
        replica: 0
    
    rule_files:
      - '/etc/prometheus/rules/hardware_rule.yml'
      - '/etc/prometheus/rules/disk_rule.yml'
    
    scrape_configs:
      - job_name: 'node-exporter'
        scrape_interval: 5s
        static_configs:
          - targets:
            - [node-exporter 서버 IP]:9100
    
      - job_name: 'dellhw_exporter'
        scrape_interval: 60s
        static_configs:
          - targets:
            - [dellhw_exporter 서버 IP]:9137

  • 일정 시간 Pending 상태로 유지 → 계속 fault가 발생하면 State에 Firing 표시

  • 알람에 문제가 발생하여 State에 Firing 표시

+ Recent posts