ADH monitoring metrics

This article describes the monitoring metrics reported by ADH to a Monitoring cluster.

Metric types

All the metrics are divided into two groups: system metrics and service metrics. Both metric types are available in Graphite/Grafana UIs after the installation of the Monitoring cluster.

System metrics indicate general characteristics of cluster hosts, usually related to resource consumption. The reported system metrics are presented below.

System metrics
Metrics group Description

cpu

CPU utilization

diskspace

Disk usage

files

File statistics

iostat

I/O performance

loadavg

System load averages

memory

Memory usage

netstat

Network connection statistics

network

Network interface performance

uptime

Uptime statistics

Service metrics reflect the characteristics of individual ADH services and their list is below.

HBase
Component Metrics group Metric name Description

RegionServer

IPC

numActiveHandler

The number of RPC handlers that serve incoming requests

numCallsInGeneralQueue

The number of user requests in the queue

numOpenConnections

The number of open connections (RPC)

JvmMetrics

GcCount

The total number of GC iterations

GcCountConcurrentMarkSweep

The number of GC iterations done with the Concurrent Mark Sweep algorithm

GcCountParNew

The number of GC iterations done with the ParNew algorithm

GcTimeMillis

The total GC time in milliseconds

GcTimeMillisConcurrentMarkSweep

The total GC time (CMS) in milliseconds

GcTimeMillisParNew

The total GC time (ParNew) in milliseconds

MemHeapCommittedM

The currently allocated heap memory in MB

MemHeapMaxM

The maximum heap memory in MB

MemHeapUsedM

The currently utilized heap memory in MB

Server

Append_95th_percentile

The 95 percentile latency for the append operation in the RegionServer

Append_median

The median latency for the append operation in the RegionServer

Delete_95th_percentile

The 95 percentile latency for the delete operation in the RegionServer

Delete_median

The median latency for the delete operation in the RegionServer

Get_95th_percentile

The 95 percentile latency for the get operation in the RegionServer

Get_median

The median latency for the get operation in the RegionServer

Increment_95th_percentile

The 95 percentile latency for the increment operation in the RegionServer

Increment_median

The median latency for the increment operation in the RegionServer

Put_95th_percentile

The 95 percentile latency for the put operation in the RegionServer

Put_median

The median latency for the put operation in the RegionServer

ScanTime_95th_percentile

The 95 percentile latency for the scan operation in the RegionServer

ScanTime_median

The median latency for the scan operation in the RegionServer

percentFilesLocal

The percent of store file data that can be read from the local DataNode

HDFS
Component Metrics group Metric name Description

datanode

FSDatasetState

NumFailedVolumes

The number of failed disks in a cluster

JvmMetrics

GcCount

The total number of GC iterations

GcCountPSMarkSweep

The number of GC iterations done with the PS MarkSweep algorithm

GcCountPSScavenge

The number of GC iterations done with the PS Scavenge algorithm

GcTimeMillis

The total GC time in milliseconds

GcTimeMillisPSMarkSweep

The total GC time (PS MarkSweep) in milliseconds

GcTimeMillisPSScavenge

The total GC time (PS Scavenge) in milliseconds

MemHeapCommittedM

The currently allocated heap memory in MB

MemHeapMaxM

The maximum heap memory in MB

MemHeapUsedM

The currently utilized heap memory in MB

RpcActivityForPort

NumOpenConnections

The number of open RPC connections

RpcProcessingTimeAvgTime

The average request processing time

RpcQueueTimeAvgTime

The average time of pending requests in a queue

namenode

FSNamesystem

BlockCapacity

The total space available for storing data blocks

BlocksTotal

The total number of data blocks

CapacityTotal

The total space available for storing data

CapacityUsed

The total capacity used for storing data

CorruptReplicatedBlocks

The number of blocks that are corrupted or have been copied repeatedly to HDFS

FilesTotal

The total number of files in HDFS

MissingBlocks

The number of missing data blocks

UnderReplicatedBlocks

The number of data blocks that do not have sufficient replicas in HDFS

JvmMetrics

GcCount

The total number of GC iterations

GcCountPSMarkSweep

The number of GC iterations done with the PS MarkSweep algorithm

GcCountPSScavenge

The number of GC iterations done with the PS Scavenge algorithm

GcTimeMillis

The total GC time in milliseconds

GcTimeMillisPSMarkSweep

The total GC time (PS MarkSweep) in milliseconds

GcTimeMillisPSScavenge

The total GC time (PS Scavenge) in milliseconds

MemHeapCommittedM

The currently allocated heap memory in MB

MemHeapMaxM

The maximum heap memory in MB

MemHeapUsedM

The currently utilized heap memory in MB

RpcActivityForPort

NumOpenConnections

The number of open RPC connections

RpcProcessingTimeAvgTime

The average request processing time

RpcQueueTimeAvgTime

The average time of pending requests in a queue

Hive
Component Metrics group Metric name Description

hiveserver2

ClassLoading

LoadedClassCount

The number of loaded classes

GarbageCollector

PSMarkSweep

The GC time spent using the PS MarkSweep algorithm

PSScavenge

The GC time spent using the PS Scavenge algorithm

HS2

active_calls_api_Driver_execute

The active_calls_api_<method_name> metrics indicate the number of corresponding method invocations at the given time

active_calls_api_Driver_run

active_calls_api_PostHook_org_apache_hadoop_hive_ql_stats_OperatorStatsReaderHook

active_calls_api_compile

active_calls_api_hs2_operation_INITIALIZED

The number of active operations in HiveServer2 (HS2) with the corresponding status

active_calls_api_hs2_operation_PENDING

active_calls_api_hs2_operation_RUNNING

active_calls_api_hs2_sql_operation_PENDING

The number of active SQL operations with the corresponding status

active_calls_api_hs2_sql_operation_RUNNING

active_calls_api_parse

The number of requests submitted to the HiveServer via the API, which were successfully parsed and ready for processing

active_calls_api_releaseLocks

The active_calls_api_<method_name> metrics indicate the number of corresponding method invocations at the given time

active_calls_api_runTasks

active_calls_api_semanticAnalyze

active_calls_api_waitCompile

active_calls_hs2_compiling_queries

The number of requests that are currently compiling

active_calls_hs2_executing_queries

The number of requests that are currently executing

active_calls_hs2_submitted_queries

The number of requests submitted for execution

api_Driver_execute

The api_<method_name> metrics indicate the number of the corresponding method invocations

api_Driver_run

api_PostHook_org_apache_hadoop_hive_ql_stats_OperatorStatsReaderHook

api_compile

api_hs2_operation_INITIALIZED

api_hs2_operation_PENDING

api_hs2_operation_RUNNING

api_hs2_sql_operation_PENDING

api_hs2_sql_operation_RUNNING

api_parse

api_releaseLocks

api_runTasks

api_semanticAnalyze

api_waitCompile

cumulative_connection_count

The total number of established connections to HiveServer2 since the server startup

exec_async_pool_size

The current size of the HiveServer2 asynchronous thread pool

exec_async_queue_size

The current size of the HiveServer2 asynchronous operation queue

hs2_active_sessions

The number of active sessions on HiveServer2

hs2_compiling_queries

The number of queries being compiled on HiveServer2

hs2_completed_operation_CLOSED

The number of completed operations with the corresponding status

hs2_completed_operation_FINISHED

hs2_completed_sql_operation_CLOSED

The number of completed SQL operations with the corresponding status

hs2_completed_sql_operation_FINISHED

hs2_executing_queries

The number of queries being executed on HiveServer2

hs2_open_sessions

The number of open sessions on HiveServer2

hs2_sql_operation_active_user

The current number of active users performing SQL operations on HiveServer2

hs2_submitted_queries

The number of queries submitted to HiveServer2

hs2_succeeded_queries

The number of queries succeeded on HiveServer2

buffers_direct_count

JVM metrics

buffers_direct_used

buffers_mapped_capacity

buffers_mapped_count

buffers_mapped_used

classLoading_loaded

classLoading_unloaded

gc_PS-MarkSweep_count

gc_PS-MarkSweep_time

gc_PS-Scavenge_count

gc_PS-Scavenge_time

jvm_pause_extraSleepTime

memory_heap_committed

memory_heap_init

memory_heap_max

memory_heap_usage

memory_heap_used

memory_non-heap_committed

memory_non-heap_init

memory_non-heap_max

memory_non-heap_usage

memory_non-heap_used

memory_pools_Code-Cache_usage

memory_pools_Compressed-Class-Space_usage

memory_pools_Metaspace_usage

memory_pools_PS-Eden-Space_usage

memory_pools_PS-Old-Gen_usage

memory_pools_PS-Survivor-Space_usage

memory_total_committed

memory_total_init

memory_total_max

memory_total_used

open_connections

open_operations

qc_current_size

qc_max_size

threads_blocked_count

threads_count

threads_daemon_count

threads_deadlock_count

threads_new_count

threads_runnable_count

threads_terminated_count

threads_timed_waiting_count

threads_waiting_count

waiting_compile_ops

Memory

HeapMemoryUsage_committed

The amount of memory allocated for heap

HeapMemoryUsage_init

The initial heap memory size

HeapMemoryUsage_max

The maximum heap memory size

HeapMemoryUsage_used

The utilized heap memory size

NonHeapMemoryUsage_committed

The amount of memory allocated for non-heap JVM areas

NonHeapMemoryUsage_init

The initial memory size allocated for non-heap JVM areas

NonHeapMemoryUsage_max

The maximum memory size allocated for non-heap JVM areas

NonHeapMemoryUsage_used

The utilized memory for non-heap JVM areas

MemoryPool

CodeCache

The code cache size

CompressedClassSpace

The size of the compressed class space

Metaspace

The metaspace size

PSEdenSpace

The size of the Eden space in the Spark History Server using the PS algorithm

PSOldGen

The size of the Old Generation space in the Spark History Server using the PS algorithm

PSSurvivorSpace

The size of the Survivor space in the Spark History Server using the PS algorithm

OperatingSystem

ProcessCpuLoad

The CPU load

Threading

DaemonThreadCount

The number of daemon threads

PeakThreadCount

The maximum number of threads

ThreadCount

The current number of threads

TotalStartedThreadCount

The total threads count

metastore

ClassLoading

LoadedClassCount

The number of loaded classes

GarbageCollector

PSMarkSweep

The GC time spent using the PS MarkSweep algorithm

PSScavenge

The GC time spent using the PS Scavenge algorithm

Memory

HeapMemoryUsage_committed

The amount of memory allocated for heap

HeapMemoryUsage_init

The initial heap memory size

HeapMemoryUsage_max

The maximum heap memory size

HeapMemoryUsage_used

The utilized heap memory size

NonHeapMemoryUsage_committed

The amount of memory allocated for non-heap JVM areas

NonHeapMemoryUsage_init

The initial memory size allocated for non-heap JVM areas

NonHeapMemoryUsage_max

The maximum memory size allocated for non-heap JVM areas

NonHeapMemoryUsage_used

The utilized memory for non-heap JVM areas

MemoryPool

CodeCache

The code cache size

CompressedClassSpace

The size of the compressed class space

Metaspace

The metaspace size

PSEdenSpace

The size of the Eden space in the Spark History Server using the PS algorithm

PSOldGen

The size of the Old Generation space in the Spark History Server using the PS algorithm

PSSurvivorSpace

The size of the Survivor space in the Spark History Server using the PS algorithm

OperatingSystem

ProcessCpuLoad

The CPU load

Threading

DaemonThreadCount

The number of daemon threads

PeakThreadCount

The maximum number of threads

ThreadCount

The current number of threads

TotalStartedThreadCount

The total threads count

metastore

PS-MarkSweep_count

The number of GC iterations done using the Concurrent Mark Sweep algorithm

PS-MarkSweep_time

The total GC time elapsed using the PS MarkSweep algorithm

PS-Scavenge_count

The number of GC iterations done using the PS Scavenge algorithm

PS-Scavenge_time

The total GC time elapsed using the PS Scavenge algorithm

active_calls_create_table

The active_calls_api_<method_name> metrics indicate the number of coreesponding method invocation at the given time

active_calls_drop_table

active_calls_get_all_functions

active_calls_get_config_value

active_calls_get_database

active_calls_get_databases

active_calls_get_functions

active_calls_get_multi_table

active_calls_get_table

active_calls_get_tables

active_calls_get_tables_by_type

api_create_table

The api_<method_name> metrics indicate the number of corresponding method invocations

api_create_table_with_environment_context

api_drop_table

api_drop_table_with_environment_context

api_flushCache

api_get_all_databases

api_get_all_functions

api_get_config_value

api_get_current_notificationEventId

api_get_database

api_get_databases

api_get_functions

api_get_multi_table

api_get_next_notification

api_get_table

api_get_table_objects_by_name_req

api_get_table_req

api_get_tables

api_get_tables_by_type

api_init

api_set_ugi

api_shutdown

blocked_count

The number of threads blocked

create_total_count_dbs

The number of databases created

create_total_count_partitions

The number of partitions created

create_total_count_tables

The number of tables created

daemon_count

The daemon count

deadlock_count

The number of deadlocks detected

delete_total_count_dbs

The total number of deleted databases

delete_total_count_partitions

The total number of deleted partitions

delete_total_count_tables

The total number of deleted tables

direct_count

JVM metrics

direct_used

directsql_errors

heap_committed

heap_init

heap_max

heap_usage

heap_used

jvm_pause_extraSleepTime

jvm_pause_info-threshold

jvm_pause_warn-threshold

loaded

mapped_capacity

mapped_count

mapped_used

new_count

non-heap_committed

non-heap_init

non-heap_max

non-heap_usage

non-heap_used

open_connections

pools_Code-Cache_usage

pools_Compressed-Class-Space_usage

pools_Metaspace_usage

pools_PS-Eden-Space_usage

pools_PS-Old-Gen_usage

pools_PS-Survivor-Space_usage

runnable_count

terminated_count

timed_waiting_count

total_committed

total_count_dbs

total_count_partitions

total_count_tables

total_init

total_max

total_used

unloaded

waiting_count

Spark/Spark3
Component Metrics group Metric name Description

historyserver

ClassLoading

LoadedClassCount

The number of loaded classes

GarbageCollector

PSMarkSweep

The GC time spent using the PS MarkSweep algorithm

PSScavenge

The GC time spent using the PS Scavenge algorithm

Memory

HeapMemoryUsage_committed

The amount of memory allocated for heap

HeapMemoryUsage_init

The initial heap memory size

HeapMemoryUsage_max

The maximum heap memory size

HeapMemoryUsage_used

The utilized heap memory size

NonHeapMemoryUsage_committed

The amount of memory allocated for non-heap JVM areas

NonHeapMemoryUsage_init

The initial memory size allocated for non-heap JVM areas

NonHeapMemoryUsage_max

The maximum memory size allocated for non-heap JVM areas

NonHeapMemoryUsage_used

The utilized memory for non-heap JVM areas

MemoryPool

CodeCache

The code cache size

CompressedClassSpace

The size of the compressed class space

Metaspace

The metaspace size

PSEdenSpace

The size of the Eden space in the Spark History Server using the PS algorithm

PSOldGen

The size of the Old Generation space in the Spark History Server using the PS algorithm

PSSurvivorSpace

The size of the Survivor space in the Spark History Server using the PS algorithm

OperatingSystem

ProcessCpuLoad

The CPU load

Threading

DaemonThreadCount

The number of daemon threads

PeakThreadCount

The maximum number of threads

ThreadCount

The current number of threads

TotalStartedThreadCount

The total threads count

YARN
Component Metrics group Metric name Description

historyserver

JvmMetrics

GcCount

The total number of GC iterations

GcCountPSMarkSweep

The number of GC iterations done with the PS MarkSweep algorithm

GcCountPSScavenge

The number of GC iterations done with the PS Scavenge algorithm

GcTimeMillis

The total GC time in milliseconds

GcTimeMillisPSMarkSweep

The total GC time (PS MarkSweep) in milliseconds

GcTimeMillisPSScavenge

The total GC time (PS Scavenge) in milliseconds

MemHeapCommittedM

The currently allocated heap memory in MB

MemHeapMaxM

The maximum heap memory in MB

MemHeapUsedM

The currently utilized heap memory in MB

nodemanager

JvmMetrics

GcCount

The total number of GC iterations

GcCountPSMarkSweep

The number of GC iterations done with the PS MarkSweep algorithm

GcCountPSScavenge

The number of GC iterations done with the PS Scavenge algorithm

GcTimeMillis

The total GC time in milliseconds

GcTimeMillisPSMarkSweep

The total GC time (PS MarkSweep) in milliseconds

GcTimeMillisPSScavenge

The total GC time (PS Scavenge) in milliseconds

MemHeapCommittedM

The currently allocated heap memory in MB

MemHeapMaxM

The maximum heap memory in MB

MemHeapUsedM

The currently utilized heap memory in MB

NodeManagerMetrics

AllocatedContainers

The number of allocated containers

AllocatedGB

The size of allocated memory in GB

AllocatedVCores

The number of allocated cores

AvailableGB

The size of available memory in GB

AvailableVCores

The number of available cores

BadLocalDirs

The number of directories available on the local disk that can be used for storing task data due to errors

BadLogDirs

The number of directories on the local disk that can not be used for storing task log files due to errors

ContainerLaunchDurationAvgTime

The average time spent on launching a task container

ContainersCompleted

The number of task containers that were completed successfully

ContainersFailed

The number of task containers that failed to complete

ContainersIniting

The number of task containers in the initialization state

ContainersKilled

The number of task containers that were forcibly stopped

ContainersLaunched

The number of task containers that were started successfully

ContainersRunning

The number of task containers that are currently running

GoodLocalDirsDiskUtilizationPerc

The percentage of disk space utilization in directories on a local disk that can be used for storing task data

GoodLogDirsDiskUtilizationPerc

The percentage of disk space utilization in directories on a local disk that can be used for storing task log files

resourcemanager

JvmMetrics

GcCount

The total number of GC iterations

GcCountPSMarkSweep

The number of GC iterations done with the PS MarkSweep algorithm

GcCountPSScavenge

The number of GC iterations done with the PS Scavenge algorithm

GcTimeMillis

The total GC time in milliseconds

GcTimeMillisPSMarkSweep

The total GC time (PS MarkSweep) in milliseconds

GcTimeMillisPSScavenge

The total GC time (PS Scavenge) in milliseconds

MemHeapCommittedM

The currently allocated heap memory in MB

MemHeapMaxM

The maximum heap memory in MB

MemHeapUsedM

The currently utilized heap memory in MB

QueueMetrics

AllocatedVCores

The number of allocated cores

AppsFailed

The number of applications exited with an error

AppsKilled

The number of applications killed by a user

AppsPending

The number of applications pending for resources

AppsRunning

The number of started applications

AppsSubmitted

The number of applications submitted to a queue

AvailableVCores

The number of available cores

Metrics visualization

To visualize monitoring metrics as graphs and charts, use the Graphite and Grafana UIs available after the installation of a monitoring cluster.

Graphite

Graphite is a monitoring tool that stores numeric time-series data and visualizes this data on graphs in web UI.

To view metrics in Graphite, enter the address of the host, where your monitoring cluster is installed, into the browser address bar. For example, http://10.20.30.444:<port>. By default, Graphite runs on the 80 port, so you can omit the port number unless you specified a different port during the installation of your monitoring cluster.

graphite metrics
Graphite metrics
graphite metrics dark
Graphite metrics

Grafana

Grafana allows you to query data and visualize metrics stored in Graphite.

To view the Grafana web UI, enter the address of the host, where your monitoring cluster is installed, into the browser address bar. For example, http://10.20.30.444:<port>. By default, Grafana UI is available on the 3000 port.

On the Grafana home page, click Home, and then select the required dashboard to view the metrics.

grafana metrics
Grafana metrics
Found a mistake? Seleсt text and press Ctrl+Enter to report it