This subsection describes the key elements of hardware monitoring and monitoring the status of node uptime and performance status. It does not consider the monitoring of system logs.
Due to the decentralized nature of the blockchain infrastructure, it is challenging to find an appropriate method for monitoring the status of all running nodes. Validating whether a node is running or not is insufficient to determine the overall health of a specific node. A node as a virtual or physical appliance could be running, while the actual blockchain-specific process is not operational. Once adequate measures have been taken to properly monitor a node's performance, then an overall network health can be determined.
The following parameters are prerequisites for effective hardware monitoring and they are blockchain agnostic. However, there might be a qualitative difference for required monitoring, based on the type of node (e.g. validator vs. read-only nodes).
It is recommended to get an overview about:
- Selection of components to be monitored
- Selection of the measured variables
What should be monitored
The status of the following node properties could be monitored:
- Operating system
- System voltage & Uninterruptible power supply (UPS)
- Liveness and readiness of node process
- Connectivity of public ports
- CPU available/used
- Memory available/used
- Disk available/used
- Bandwidth available/used
- Link Speed
How should it be monitored?
The node properties could be monitored a follows:
- Use of a watchdog (hardware or software)
- Sensor technology
- Storage technology for metrics
- Transmission protocols and architecture (push vs pull)
- Display of the measured values and visualisation
- Selection of the analysis method
- what to alert: The "what" can be inferred from standard operational practices and might be blockchain agnostic, e.g. selection of relevant metrics and criteria
- how to alert
- who to alert
- Definition of sampling rates (per Metric)
- Selection of the transmission protocols
- Monitoring the infrastructure layer introduces overhead caused by reporting processes
- Monitoring the blockchain layer require resources. The process involves parsing the blockchain and creating a regular database with the parsed information. This may use 10 times more disk space as the blockchain itself uses. Disk, CPU and Memory requirements for monitoring the blockchain layer grows together with the blockchain disk footprint
Unresolved questions are:
- Defining a process to ingest and visualize the data reported by the core nodes
- Defining a process for reporting hardware resource available and usage
- How to report the information needed for monitoring without granting more permissions that are required to perform the tasks
- Establishing an alerting strategy when multiple node owners are involved
Internal references and dependencies
'How to alert' and 'who to alert' refers to the section Collaboration.
References to best practices, examples
Core nodes are different for each blockchain infrastructure architecture.
For public ethereum nodes, the bootnodes are the core nodes, as described in the documentation.
The way the Ethereum team monitors their bootnodes has been addressed by Péter Szilágyi at Devcon 5 Monitoring an Ethereum infrastructure. The video describes monitoring at infrastructure layer. Péter mentions they used DataDog, Graphana, Phrometheus and InfluxDB.
The PoW mining nodes can also be monitored individually by each node owner.
For the blockchain layer, this service category is mentioned in EBSI V1, however it may contain further monitoring tools. As of V1, the existing monitoring tools are block explorers.
Finding an appropriate way of monitoring the infrastructure layer is challenging and is highly dependent on the infrastructure architecture. Security considerations have to be taken into account when the core nodes are owned and controlled by various parties.
The blockchain layer monitoring can be easily adapted from a working solution if the underlying blockchain technology is the same (ie: Ethereum based blockchain)
Authors: Iosif Peterfi, David Maas, Chinmay Khandekar, Kevin Wittek, Andrei Ionita
Status: work in progress
Last modified: 2021-03-17