How long have you been waiting when the DHCP agent service was restarted? If you raised your hand and shouted “too long!”, this is your post.
OpenStack Neutron DHCP agent.
From the Red Hat OSP16 documentation, “the OpenStack Networking DHCP agent manages the network namespaces that are spawned for each project network to act as DHCP server. Each namespace runs a dnsmasq process that can allocate IP addresses to virtual machines on the network. If the agent is enabled and running when a subnet is created then by default that subnet has DHCP enabled.”
The Neutron DHCP creates a new network namespace per network. On this namespace, a “dnsmasq” instance is spawned. Each network will have one or many subnets with its corresponding CIDRs; a new port created on this network will receive an IP address from this pool of IP addresses, that is the combination of those CIDRs (for the sake of simplicity, we are not considering any subnet pool or address scope).
When a subnet is created, updated or deleted, the Neutron DHCP agent updates the “dnsmasq” configuration and forces it to reload the new configuration, sending a SIGHUP signal.
High availability for DHCP.
This document describes how to configure the Neutron DHCP agent to have high availability. In a nutshell, Neutron allows to assign a network to one or more Neutron DHCP agents. This number is controlled by the configuration variable
dhcp_agents_per_network. The Neutron server has a DHCP scheduler that will assign any new network to a set of agents. This is how we have high availability for the DHCP service. If an agent is down, the others will attend the DHCP request from a port.
The issue (too much, too loaded).
In environments with many networks, the Neutron DHCP agent re-sync process can take a long time. This re-sync process is done when the agent is restarted (or when an issue is detected). Every time a Neutron DHCP agent is restarted, it will need to request from the Neutron server all the networks, subnets and ports information. That could take several minutes and in some cases more than one hour. In a customer environment with around 1500 network and 15K ports, each agent took more than 100 minutes to be fully operational.
The time spent by a Neutron DHCP agent in the re-sync process could be measured by checking the timestamps of the following INFO messages (from the agent logs):
– “Synchronizing state”
– “Synchronizing state complete”
Between those two messages, the agent requests the information of all networks assigned and the related subnets and ports. Each network is processed individually and will be reported in the logs. The bigger the number of networks and ports is, the longer the re-sync process time is.
Reduce the redundancy.
Unluckily, this process is not automatic and requires the intervention of the system administrator. The are two steps. The first one is to reconfigure the number of agents per network; that means lower the value of
dhcp_agents_per_network. Once changed, it is needed to restart all the Neutron server processes (I assume you have a HA environment and there are more than one Neutron server instance running). Once restarted, each new network will be scheduled to the new number.
However the existing networks are not rescheduled. This is the second step: to remove the network / Neutron DHCP agent bindings. The system administrator needs to rebalance the number of networks assigned to the agents manually (mental note: to implement a script to make this process automatically). To remove the network / Neutron DHCP agent assignation, you need to execute:
$ neutron dhcp-agent-network-remove <DHCP_agent_id> <network_id_or_name>
I hope that helped you.