Problems with ntpd shutting down

1/19/2023 Several days into the testing on this box, I am finding that the ntpd time server daemon is disconnecting from the GPS and then 7 hours later, it shuts down. I am still isolating the problem. The bare GPS NMEA and 1pps interface is done via the gpsd daemon. The ntpd daemon interfaces with the gpsd daemon through a shared memory block.

I have used ntpd in the distant past, and it seems reliable enough. It is pretty complex and the configuration for it is definitely not trivial. In prior applications, I had it talking directly with the GPS modules.

The Trimble Copernicus II GPS modules have been working well for me in several other applications for years. This is a new board from an old design, so It's possible that the GPS is having problems, but that seems less likely.

The gpsd daemon is a new one for me and it does not seem to be doing anything that I really need in this application. My first step is setting up a test platform that has ntpd talking directly with the GPS and removing gpsd.

I just got the hardware and software together to test this idea. Now it is time to wait and see how it goes.

If the problem is not solved, I will swap the GPS module with a new design. The Trimble Copernicus II modules are not made anymore and I could not find any around from a credible source at a reasonable price. ST has a new line of low cost GPS modules called Teseo LIV3 and I put together a small breakout board for them before the holidays.

It is possible that the problem is in the configuration of ntpd, and I am going to leave that one for last because it is going to be the most difficult one to pursue.

1/26/2023 I am not ready to declare victory on this one, but I have gotten 6 days of operation from ntpd without a shutdown. Previously, I was getting a few hours up to 2 1/2 days before ntpd disconnected from the gps and pps sources. I swapped the Trimble GPS module from my test platform into the box and made the configuration changes necessary to disable gpsd and change ntpd over to use the Trimble TSIP protocol instead of NMEA.

I added another bicolor LED and plugged it into an unused socket on the misc LED controller board to use as an ntpd status indicator. I wrote the code to exec the ntpq tool and parse the output to determine if ntpd was still connected to the gps and pps sources and added a threshold on the jitter set to 1.0 mS. If the ntpd daemon is shut down, the status LED shows red. If ntpd has disconnected from the gps and pps sources, the LED will show yellow and if the jitter values are > 1.0mS, the LED will show green. If everything is good, the LED is off. This code runs periodically (every 2 minutes right now) and updates the LED as described. This status indicator was added so that I don't have to power up another machine to check the ntpd health, I can just look for the LED.

At some point, I will re-configure the GPS module that I pulled out and put it on the test platform to test the GPS module. This is just sort of a sanity thing, I would like to know for sure that the problem was not the hardware. That sounds like a thing for another day though.

The system is running on the box with the display now, for some longer term testing.

1/29/2023 Yesterday, at about 02:00, I noticed that the ntpd status LED was showing that it could not communicate with ntpd. Testing manually, ntpd was fine and there was nothing in the logs to indicate that it had a problem. The way the monitor is coded, the LED shows red if it could not communicate with the daemon, so if the monitor had a problem, it would show red also. Thinking about the problem (instead of sleeping), it occurred to me that the monitor had been running about 36 hours when it stopped working. With the monitor running once, every 2 minutes, that is around 1000 iterations, which is suspiciously close to the max number of file descriptors allowed on a process.

The way that I coded the monitor, upon requesting status, a new process is forked off, and a pipe allocated to ship the data from the child process back to the parent process. The stdout file descriptor is replaced with the write end of the pipe in the child process using a dup2 () call. The child process exec's the ntpq command to get the status from ntpd, and the parent process (my monitor code) reads the ntpq stdout data from the read end of the pipe. When the read is complete, the pipe is closed which should return the file descriptor to the OS. After the dup2() call, it is necessary to close the both the read and the remaining write descriptors for the pipe in the child process and the write end of the pipe in the parent process or the descriptors will not get returned to the OS when the child process exits.

It turned out that I had forgotten to close the write end of the pipe after the dup2() call in the child process, and a new file descriptor was getting allocated each time it was called. This was visible because the file descriptor value was incrementing each time the monitor was called. After adding the close on the write file descriptor in the child process, the incrementing stopped and the file descriptor has the same value each run as it should.

The ntpd daemon ran through all the debug and testing with no issues.

1/30 After 48 hours of operation, the monitor is still running fine, as is ntpd. Good news on both fronts.

2/8 Ntpd ran for over a week with no issues, I think that this issue is solved. The ntpd monitor that I added to the LED time display system is working fine.

Discussions

Ken Yap wrote 01/19/2023 at 23:55

I turned a RPi 1 into a GNSS time server in #An NTP server using GNSS for time Never had any problem with ntpd (chronyd in the final version) stopping. One thing I did was use a light Debian distro called DietPi instead of the usual Raspberry Pi OS, to minimise the OS RAM footprint.

Are you sure? yes | no

Raspi RAM Overlay File System Startup issue.

Discussions

Become a Hackaday.io Member