TLDR; We fixed Scheduled Builds. After splitting the scheduled build worker into a stand-alone micro-service and enabling debug logging in production we found an underlying issue in the open source library being used to trigger scheduled jobs when running in a distributed environment. By reducing the number of worker instances to one, we were able to work around the bug and for roughly the past 40 hours all scheduled builds are running on time.
Hi All,
Many of you reported difficulties with Scheduled Builds starting after they were supposed to. The following is a technical explanation of what happened and what we did to investigate the issue. We are working on a way to make it clearer when we know about live production issues and are investigating, more on that in a later post.
The Investigation
In an effort to provide transparency across Unity Cloud Build we wanted to provide insight into the resolution and investigation around scheduled builds that were not starting on time. To preface, we are still investigating the performance of auto-builds that are polling source systems which includes; Git SSH, Perforce, SVN, and Plastic. However, this post will purely focus on the efforts related to scheduled builds.
Over the past several weeks we began to see an increase in the number of forum posts as well as support tickets related to scheduled builds (i.e. Daily at 10:00 AM) not running as scheduled. Our initial attempts to reproduce this issue in our non-production environments were fruitless. We were unable to track down the root cause. It would appear that the issue would happen in production and then suddenly disappear. The logic for polling builds and scheduled builds existed in the same micro-service. After a span of about 4 hours the CPU utilization of the service would spike to 100% and rarely decrease below 95%. Memory usage was also high. We doubled the number of pods running this service and saw an increase in reliability for triggering scheduled builds. However, once again, after hours of running without issue the CPU usage would spike. Once the CPU usage spiked and reached 100% scheduled builds were no longer running on time.
This led to us splitting apart this microservice into two; a service for polling builds and a separate service for scheduled builds. After splitting the services earlier this week there was once again a spike in consistency for the scheduled builds. CPU and memory utilization for the new scheduled worker continued to remain low; however, once again after several hours of running the scheduled builds would stop triggering. This was extremely perplexing as there were no resource constraints and the configuration of the scheduler itself shouldn’t have prevented locks from being placed on scheduled jobs so that they could run.
We enabled debug logging in production and after several hours began to see log messages related to the lock limit filling up. We reduced the number of pods running the worker and increased the lock limit to a number well over the number of scheduled builds in our system. By all rights, this should have resolved the issue. Unfortunately this did not prevent the issue from occurring. After a few hours the lock limit log issue popped back up. We had attempted the majority of the solutions that had been proposed for similar issues that users had encountered according to the open source libraries GitHub issues. None of these solutions were working for us.
The library stated that when running multiple workers it was suggested to set a limit for the number of jobs that could be locked by a worker to prevent a single worker from locking all of the jobs. If a single worker locked all of the jobs then it could cause a delay in the ability to start jobs as scheduled. After our investigation we believe that there is an underlying issue with how the open source library handles locks in a multi-worker setup. This results in the lock limits being reached, even when the number of jobs that can be locked is below the lock limit.
The Resolution
We reduced the number of scheduled workers running to one, increased the CPU and memory for the single worker, and completely removed the lock limit. After running overnight (in the NA timezones) every scheduled job started within roughly a minute or less of the scheduled start time. The logs no longer mention the lock limit being reached and our resource utilization has remained well below 30% of the available resources allocated to the system.
Next Steps
We are disappointed that we didn’t discover this issue on our own and instead had to rely on y’all posting on the forums or submitting tickets for the scheduled builds not starting on time. In an effort to be in front of further issues related to scheduled builds we will be adding alerting based on the metadata in the scheduled builds database table. If a job, which is responsible for triggering a build, is more than a couple of minutes late running we want to generate an alert.
We’ll keep you informed of updates.
Thanks,
Unity Cloud Build