On Android devices containing the Snapdragon 800 series SOC our game will crash with “signal 11 (SIGSEGV), code 1 (SEGV_MAPERR)”. If Unity is able to log an exception the stack trace will be corrupt, showing methods that cannot be called. We are building with Unity 4.6.1p1
Our game has been in production on iOS for 6 months where it is very stable. The area of the code where the crash occurs is common to both iOS and Android so I do not believe it is an issue with our codebase (uninitialised variable etc.).
We have correlated the crash to the Snapdragon 800 series but we have no causal effect. The list of affected devices that we have identified are:
- Google Nexus 5 (A 4.4.4) (A 5.0) (A 5.0.1)
- HTC One M8 (A 4.4.4)
- Sony Xperia Tablet Z2 (A 4.4.2)
- Sony Xperia Phone Z2 (A 4.4.4)
- Samsung Galaxy S5 (A 4.4.2)
Most Android devices work perfectly; unfortunately those listed above represent the flagship devices for several important manufacturers (and so can’t be blacklisted). All contain the Snapdragon 800 series chip. We are yet to see this issue on a device without said chip.
We have also tested on the Kindle Fire HDX, which contains a Snapdragon 800. This has been the only device with the chip not to fail; however, given Fire OS’s significant deviation from “standard” Android we are treating it as corner case.
Observations
-
The crash does not occur on every run but, if it is going to crash
-
it will always happen at the same place
-
The crash usually terminates the process resulting in “signal 11 (SIGSEGV), code 1 (SEGV_MAPERR)”; however
-
sometimes the crash (at the same point) does not terminate the process
-
and instead Unity logs NullReferenceException
-
If NullReferenceException is logged, the stack trace shows code that cannot be run
-
the first few lines of the trace are correct
-
then a call is made to a method that is not referenced from the traced callsite
-
The erroneous method crashes accessing a C# autogenerated property
-
the NullReferenceException is thrown from within the autogen’d property
-
the getter contains only compiler generated code
-
it should never error unless “this” has been null’d during the access
-
(we have no multi-threaded code and no reflection)
-
Switching from Dalvik to ART stops the error from occurring (no crash, perfect behaviour)
-
however, upgrading the Nexus 5 to Android 5.0 (and 5.0.1), which has ART enabled by default, causes the crash to occur more frequently!
Given the NullReferenceException it is tempting to conclude an uninitialised instance is being dereferenced; however:
-
the same code runs flawlessly on:
-
iOS
-
most Android devices
-
OSX
-
Windows and Windows Phone
-
Unity Editor
-
the exception’s stack trace should not occur.
-
As unlikely as this may sound, both manual inspection and Resharper’s static usage analysis show no path between the last legitimate method call and the next erroneous one (where the exception occurs).
-
To prove the stack trace logging is accurate and not corrupt itself, we have inserted logging into the erroneous methods and they are executed.
On a non-crashing run the erroneous code is not executed - the correct path through the code is taken.
It would appear that the wrong code is being run! My hunch is that the Mono JIT is selecting incorrect IL to compile; however, other than mentions of trampolines in the tombstones (suggesting initial method access), I have little evidence to prove this.
On non-crashing Android devices there is a very short pause in execution when reaching the would-be crash site; I normally attribute this to JITing as it’s the first occurrence of the code’s execution. This timing correlates with the crash but is JIT to blame? And if so, why only on these Android devices?
Restarting the process on device sometimes causes a crashing game to execute flawlessly. Logging shows the erroneous path through the code is not taken, the correct path is. We have not performed a rebuild in Unity - this is the same APK, just restarted. As far as I know, Mono JIT is the only agent capable of affecting the code’s static execution path.
We’ve taken this investigation as far as we can without support from Unity/the community. It would be very interesting to hear if anyone else has experienced this problem. It’s similar to @zibizz1 's comment in this post however I’ve made the problem more specific here, focussing on the corrupt stack trace.