Snapdragon 800 crashes with corrupt stack trace

On Android devices containing the Snapdragon 800 series SOC our game will crash with “signal 11 (SIGSEGV), code 1 (SEGV_MAPERR)”. If Unity is able to log an exception the stack trace will be corrupt, showing methods that cannot be called. We are building with Unity 4.6.1p1

Our game has been in production on iOS for 6 months where it is very stable. The area of the code where the crash occurs is common to both iOS and Android so I do not believe it is an issue with our codebase (uninitialised variable etc.).

We have correlated the crash to the Snapdragon 800 series but we have no causal effect. The list of affected devices that we have identified are:

  • Google Nexus 5 (A 4.4.4) (A 5.0) (A 5.0.1)
  • HTC One M8 (A 4.4.4)
  • Sony Xperia Tablet Z2 (A 4.4.2)
  • Sony Xperia Phone Z2 (A 4.4.4)
  • Samsung Galaxy S5 (A 4.4.2)

Most Android devices work perfectly; unfortunately those listed above represent the flagship devices for several important manufacturers (and so can’t be blacklisted). All contain the Snapdragon 800 series chip. We are yet to see this issue on a device without said chip.

We have also tested on the Kindle Fire HDX, which contains a Snapdragon 800. This has been the only device with the chip not to fail; however, given Fire OS’s significant deviation from “standard” Android we are treating it as corner case.

Observations

  • The crash does not occur on every run but, if it is going to crash

  • it will always happen at the same place

  • The crash usually terminates the process resulting in “signal 11 (SIGSEGV), code 1 (SEGV_MAPERR)”; however

  • sometimes the crash (at the same point) does not terminate the process

  • and instead Unity logs NullReferenceException

  • If NullReferenceException is logged, the stack trace shows code that cannot be run

  • the first few lines of the trace are correct

  • then a call is made to a method that is not referenced from the traced callsite

  • The erroneous method crashes accessing a C# autogenerated property

  • the NullReferenceException is thrown from within the autogen’d property

  • the getter contains only compiler generated code

  • it should never error unless “this” has been null’d during the access

  • (we have no multi-threaded code and no reflection)

  • Switching from Dalvik to ART stops the error from occurring (no crash, perfect behaviour)

  • however, upgrading the Nexus 5 to Android 5.0 (and 5.0.1), which has ART enabled by default, causes the crash to occur more frequently!

Given the NullReferenceException it is tempting to conclude an uninitialised instance is being dereferenced; however:

  • the same code runs flawlessly on:

  • iOS

  • most Android devices

  • OSX

  • Windows and Windows Phone

  • Unity Editor

  • the exception’s stack trace should not occur.

  • As unlikely as this may sound, both manual inspection and Resharper’s static usage analysis show no path between the last legitimate method call and the next erroneous one (where the exception occurs).

  • To prove the stack trace logging is accurate and not corrupt itself, we have inserted logging into the erroneous methods and they are executed.

On a non-crashing run the erroneous code is not executed - the correct path through the code is taken.
It would appear that the wrong code is being run! My hunch is that the Mono JIT is selecting incorrect IL to compile; however, other than mentions of trampolines in the tombstones (suggesting initial method access), I have little evidence to prove this.

On non-crashing Android devices there is a very short pause in execution when reaching the would-be crash site; I normally attribute this to JITing as it’s the first occurrence of the code’s execution. This timing correlates with the crash but is JIT to blame? And if so, why only on these Android devices?

Restarting the process on device sometimes causes a crashing game to execute flawlessly. Logging shows the erroneous path through the code is not taken, the correct path is. We have not performed a rebuild in Unity - this is the same APK, just restarted. As far as I know, Mono JIT is the only agent capable of affecting the code’s static execution path.

We’ve taken this investigation as far as we can without support from Unity/the community. It would be very interesting to hear if anyone else has experienced this problem. It’s similar to @zibizz1 's comment in this post however I’ve made the problem more specific here, focussing on the corrupt stack trace.

We are working on a fix for a bug where a NullReferenceException causes a segfault. That bug first appeared in Unity 4.6.0p1

Thanks for the update!
Unfortunately we see the segfault in Unity 4.5.4f1

Do you have any advice on how to further diagnose the problem?

Are you sure that it’s the same crash you see in Unity 4.5.4f1? We fixed two memory errors that are specific to Adreno devices since 4.5.4f1.
Best would be to file a bug report with repro case.

Hi Flo

I filed a bug report back in November (655263) but it’s had no response yet. I’ve been unable to distill the problem into a reproducible case - I’m working on it.

I’m having difficulty because the crash does not happen on every run and I’m unsure exactly what triggers it. The best case is it happens on the first run, the worst case was 49 runs between crashes.

This morning I upgraded to 4.6.1p3 and the game crashed at the same point - with a NullReferenceException thrown from my code but with a corrupt stack trace, i.e. the method throwing the exception CAN NOT be called from the call site.

This is the same behaviour we experienced back on 4.5.4f1 (I rebuilt last night with 4.5.4f1 to confirm).

The tombstone referenced mono trampolines suggesting one of the failing methods was being executed for the first time. My hunch is that something (Adreno OpenGL ES driver?) has corrupted memory and the JIT compiler selected the wrong method to compile. I can’t prove this but I’m trying. I included the tombstones with my bug report.

Any thoughts?

We have the same issue. (Unity 4.6.1p4)
devices:
HTC One M8, LG G3,

We are getting NullReference with corupted stack, if we change this part of code where stack is corupting it moves to another place.

We had this before on most of samsung tablets, and we fixed that by taking diffrent version of code repository.( We are using hg and all peapople should have the same files but when one of teammate used his own project to build apk it stopped crashing) Now we have it again in diffrent place in code and difrent devices, and we don’t know how to fix it.

@cgJames try difrent player setting (dynamicbatching on/off, openGl2.0/Automatic)
On Xperia Z2 phones and OnePlus One more stablity gave us disabling dynamic batching and forcing OpenGL2.0 in playerSettings.

And yes, switching from Dalvik to ART helps, but users just give us 1 start when game crashes

We were trying to change properties to methods but changing part of code with error only move problem to other place.
It looks like something (I think unity C++ code ) is writing to C# memory(stack). Switching OpenGL to 3.0(or Automatic) also solves this issue(but we have performance issue on Nexus Player TV with this settings:()

@zibizz1 Unity QA have been able to reproduce the issue via my bug report (655263). It’s been forwarded to the dev team so hopefully we’ll have a resolution soon.

The bug now exists in issue tracker. I’m sure getting this bug fixed will stop a lot of random crashes on flagship devices running Kitkat. I’ve checked and the fix schedule is unknown so please head over to issue tracker and up vote it!

Any progress on this?
I’ve had a user report start up app crash on Galaxy Note 4 which is a snapdragon 805 on Android 4.4.4

Sadly no progress. It still occurred in Unity 5 RC1. I’ve not checked the GM but issue tracker still shows the bug’s status as active.

We need more votes to get it fixed so if you have any weird crashes or unexplained behaviour on Android, this bug could be affecting you so head over and up vote it!

Not sure if I’m reading the issue tracker page but it looks like the issue has been closed because it’s not reproducible though based on recent comments it still appears to be happening in Unity 4 and 5.

That’s weird. The fogbugz report that cgJames posted clearly has Unity QA saying they’ve reproduced it, but now that same bug on issuetracker has been closed as “not reproducible”…

Is there an explanation for this confusion? @florianpenzkofer ?

The case reported by cgJames (655263) was fixed a while ago.
The case got reopened internally by QA but that was actually a different crash, so it got closed again with a wrong status.