Creating particles using point sprite / billboard with cg

Hi everyone

Based on the post http://forum.unity3d.com/threads/158891-Unity4-point-sprite-texture-coordinates.

I just started with shader programming and I’m trying to create a small particle system using cg. I want to use this to place the particles randomly.

Now I’m dealing with the same question as in the post above. The problem seems to be that every point gets just one color. What it does is to look on the texture with the coords of the point. When I create a mesh like in the post with the double size of the texture ( 64x64 = 128 ) i get something like that:

1216114--49708--$pointSprite.png

My Shader

Shader "Custom/SamplePointShader" {
	Properties {
		_MainTex ("Texture Image", 2D) = "" {}
		_SpriteSize("SpriteSize", Float) = 10
	}
	SubShader {	
		Pass{
		CGPROGRAM
		#pragma vertex vert
		#pragma fragment frag
		#include "UnityCG.cginc"		
		
		sampler2D _MainTex;
		fixed4 _TintColor;
		float _SpriteSize;
		
		struct vertexInput {
			float4 pos : POSITION;
			fixed4 col : COLOR;
			float2 UV : TEXCOORD0;
		};

		struct fragmentInput {
			float4 pos : SV_POSITION;
			float size : PSIZE;
			fixed4 col : COLOR;
			float2 UV: TEXCOORD0;
		};
		
              
              
		fragmentInput vert(vertexInput input) {
			
			fragmentInput output = (fragmentInput)0;
			
			output.pos = mul(UNITY_MATRIX_MVP, input.pos);
            output.col = input.col;
			output.size = _SpriteSize;
			output.UV = input.UV;
			return output;			
		}
		
		float4 frag(fragmentInput input) : COLOR
		{			
			return tex2D(_MainTex, float2(input.UV.x / _SpriteSize,input.UV.y / _SpriteSize));
			}
		ENDCG
	} 
	
	}
}

How is it possible to show the whole texture on one point? Is it possible?

I also tried to create particles based on the billboard technique (http://en.wikibooks.org/wiki/Cg_Programming/Unity/Billboards).

It really looks like a nice particle:

My Billboard Shader

Shader "Custom/SampleBillboard" {
	Properties {
		_TintColor ("Tint Color", Color) = (0.5,0.5,0.5,0.5)
		_MainTex ("Texture Image", 2D) = "" {}
		_SpriteSize("SpriteSize", Float) = 10
	}
	SubShader {	
		Pass{
			Cull Off 
			Lighting Off 
			ZWrite Off
         	Blend SrcAlpha One
         	AlphaTest Greater .01
         	ColorMask RGB
			
		CGPROGRAM
		#pragma vertex vert
		#pragma fragment frag
		#include "UnityCG.cginc"		
		
		sampler2D _MainTex;
		fixed4 _TintColor;
		float _SpriteSize;
		
		struct vertexInput {
			float4 pos : POSITION;
			fixed4 col : COLOR;
			float2 UV : TEXCOORD0;
		};

		struct fragmentInput {
			float4 pos : SV_POSITION;
			fixed4 col : COLOR;
			float2 UV: TEXCOORD1;
		};
		
              
              
		fragmentInput vert(vertexInput input) {
			
			fragmentInput output = (fragmentInput)0;
			output.pos =  mul(UNITY_MATRIX_P,  mul(UNITY_MATRIX_MV, float4(0.0, 0.0, 0.0, 1))+ float4(input.pos.x, input.pos.y, 0.0, 0.0));
            output.col = input.col;
			output.UV = input.UV;
			return output;			
		}
		
		float4 frag(fragmentInput input) : COLOR
		{			
			return 2.0f * input.col * _TintColor * tex2D(_MainTex, float2(input.UV));
		}
		ENDCG
	} 
	
	}
}

But now I got some basic problems. Objects that are behind the particle hides the particle.

Beyond this projection matrix problem when I would use the billboard solution I have to create for every particle an object and attach the shader to it. Is it achievable to create those in the shader?

Or am I totally wrong and do I have to use something like a geometry shader to achieve all this?

Sorry for the bunch of questions, and my bad english ;).

I´m now one step further and I try to improve my shader.

The billboard shader needs multiple pass sections for a correct alpha-sorting. In this case, one pass with “Cull Back” and one pass with “Cull Front” is required. According to this I set the tag “Queue”=“Transparent +100”.

But the side effect of the “Two-Pass” solution is that it breaks the batching. (http://docs.unity3d.com/Documentation/Manual/DrawCallBatching.html)

Is there a possible way to this, without needing two pass sections? Because the batching reduces the draw calls about 30%+.

But it would be much better if there is a solution for the point-sprite shader.

I found some options on GLSL and HLSL like “ARB_point_sprite” (GLSL) and “PointSpriteEnable” (HLSL) which should turn on the full texture mapping on a point sprite.

I`m still searching for an option in Cg / Shaderlab… Exists such an option?

Thanks in advance!!! :slight_smile:

I guess no one found a way to include UV’s with point sprites. I too would very much like this functionality. Any word from the dev’s on whether this will be included at some point?

After hours spending on this, I’m getting closer to a solution.

I discard the point sprite solution and rebuild it with a computed shader, and a geometry shader. The geometry shader creates the billboard, the compute shader computes the new position based on the perlin noise algorithm.

Basically everthing works fine. But I need to draw 1,000,000 particles. A compute shader cannot handle more than 65,535 dimensions (x * y * z).

Now i try to split those particles up in smaller packages. Though my display driver crashes (NVIDIA GTX 560M). This happens between 262,140 or 524,280 particles depending on the ported perlin noise algorithm. One is based on the libnoise algorithm and one is based on a java port from ken perlin.

Is there a bug / problem with huge amount of calculation done by a compute shader? When i do my calculations on the cpu side, it works like a charm.

However I need the performance advantage from the compute shader! I want to move these particles around in the update function and no one wants to wait 3+ seconds to see the next frame :).

A curious thing is that the code below works without a crash when I first call computeBuffer.GetData(). See second code example.

The crash occurs OnPostRender after i call Graphics.DrawProcuedural(). Neither in the event log of windows nor in a log file exists a failure message.

That’s how I tried it:

               int dispatchsToRun = instanceCount / MaxDimensionToDispatch; //MaxDimensionToDispatch is set to 65535
		_materialsToDraw = new Material[dispatchsToRun];

		int j = 0;
		
		do {
			var positionsToRun = _pos.Skip(MaxDimensionToDispatch * j).Take(MaxDimensionToDispatch).ToArray();
			
			var mat = new Material(MaterialToUse);
			mat.hideFlags = HideFlags.HideAndDontSave;
									
			var computeBuffer = new ComputeBuffer (instanceCount, 12);
			computeBuffer.SetData(positionsToRun);
			
			AttractorShader.SetFloat("XScale", XScale);
			AttractorShader.SetFloat("YScale", YScale);
			AttractorShader.SetFloat("ZScale", ZScale);
			
			AttractorShader.SetBuffer(AttractorShader.FindKernel("CSMain"), "buf_Positions", computeBuffer);
			AttractorShader.Dispatch(AttractorShader.FindKernel("CSMain"), positionsToRun.Length, 1, 1);
			
			mat.SetBuffer("buf_Positions", computeBuffer);
			
			_materialsToDraw[j] = mat;
			
			j++;
		} while (j < dispatchsToRun);

Works but is slow as hell! GetData takes 500ms to perform! That’s slower than the cpu needs.

int dispatchsToRun = instanceCount / MaxDimensionToDispatch;

		int j = 0;
		
		var nPositions = new List<Vector3>();
		
		do {
			var positionsToRun = _pos.Skip(MaxDimensionToDispatch * j).Take(MaxDimensionToDispatch).ToArray();
							
			var computeBuffer = new ComputeBuffer (instanceCount, 12);
			computeBuffer.SetData(positionsToRun);
			
			AttractorShader.SetFloat("XScale", XScale);
			AttractorShader.SetFloat("YScale", YScale);
			AttractorShader.SetFloat("ZScale", ZScale);
			
			AttractorShader.SetBuffer(AttractorShader.FindKernel("CSMain"), "buf_Positions", computeBuffer);
			AttractorShader.Dispatch(AttractorShader.FindKernel("CSMain"), positionsToRun.Length, 1, 1);
			
			computeBuffer.GetData(positionsToRun);
			
			nPositions.AddRange(positionsToRun);
			
			j++;
		} while (j < dispatchsToRun);
		
		_bufferPos = new ComputeBuffer (instanceCount, 12);
		_bufferPos.SetData(nPositions.ToArray());

Simple post render call:

void OnPostRender () {
               MaterialToUse.SetPass(0);
	       Graphics.DrawProcedural (MeshTopology.Points, instanceCount);
	}

While a dispatch cannot handle more than 65,535 dimensions, it is important to realise that it affects your compute shader numthreads. Which means you can achieve more than 65,535 particles.

e.g. If you have

shader.dispatch(Kernel, 4096, 1, 1)

and in your compute shader you have

[numthreads(1024,1,1)]

That will give up 4096*1024 = 4,194,304 particles in total.

Then you just have to use that total value when creating your buffers to ensure you have the correct number of entries and use
Graphics.DrawProcedural (MeshTopology.Points, 4194304, 1);

To draw them.

Obviously you wouldn’t hard-code these numbers. Easiest method is to decide on the number of threads in the compute shader, e.g. 1024. Then in your code have a property for desiredNumParticles and just divide this by 1024 to get the value for your ‘dispatch’. Though its important to make desiredNumParticles a multiple of 1024 for obvious reasons.

Thank you for your explanation!

That was exactly what I was looking for! :smile:

I’m stuck on implementing a strange attractor with a compute shader. The problem is that every point is based on the calculated point before.

How do I store a calculated value for the next thread? How do I synchronize them? I have read some articles and posts on msdn about “globallycoherent”, “sync”, “MemoryBarriers” and “groupshared”. But I don’t get it. Everything I’ve tried has not worked.

http://msdn.microsoft.com/en-us/library/windows/desktop/hh447241%28v=vs.85%29.aspx

http://msdn.microsoft.com/en-us/library/windows/desktop/ff471475%28v=vs.85%29.aspx

http://msdn.microsoft.com/en-us/library/windows/desktop/ff471569%28v=vs.85%29.aspx

On the c# side I simply do this:

        Vector3 nPos = new Vector3(0f, 0f, 0f);
		
		for (int i = 0; i < instanceCount; i++)
	    {		
	        nPos = calcPickover(nPos.x, nPos.y, nPos.z);	
		_pos[i] = nPos;
	    }

Now I want to do something similiar on my compute shader. I didn’t find any example about global synchronization.

Another kick in the right direction would be great!

Thanks! :slight_smile:

Take a look at the DX11 particle system i made its completely free for the community to use. I think i actually based the compute shader i made based on your original shader at the top of the thread (mixed in with a few sources)

Its a bit buggy but working and im actually using it in a professional project

LINK: https://dl.dropboxusercontent.com/u/8353693/New%20systems/Final/DX11-Community%20Particle%20System_v1-0.unitypackage

Its also in the showcase thread under DX11 Particle system Free

Thank you for your example bajeo :smile:!

Correct me if I’m wrong: In your system each particle is calculated independently and is just affected by input values.

However a particle in a strange attractor system needs to know the position of the previous particle. That’s why I try to get the latest particle position in my compute shader.

I created a small ping-pong shader where I switch the compute buffers. As you can imagine it’s really slow (51,200 = 0.124ms, 102,400 = 3.9s). For comparison my cpu computes 1,000,000 positions within 3.4 seconds!

What I want to achieve is to port the calculation from the cpu to the gpu with a single run (from the perspective of the cpu). I’m looking for a solution where I can compute the position and then store it in a global synchronized variable.

Is that even possible?

It’s working. I don’t know why I was so focused on threads and their synchronization.

I now create an array with the size of particle instances. Pass it to my compute shader with the dimension 1 x 1 x 1 and with the numthreads of 1 x 1 x 1. In the compute shader I do the same as in the c# code. A for loop runs over the buffer, computes the new position and save it to a variable outside the for loop.

After this stage I concatenate the result with my next compute shader which calculates the perlin noise and scales the particles up.

I expected that the for loop is too slow in my first stage but it’s damn fast! The Particle-Field (1,024,000 points) appears within 00.007ms!

Hey Nightking,

Glad you got it working, hope my example helped in some way. Mine was mainly focusing on making basic particle systems but that could be easily extended. For example with about 50 lines of C# code i turned it into a full volumetric cloud system all using perlin noise.

I was quite tempted to look at adding attractors and repulsors when i was working on it but ran out of time. It would be cool to see a result and how it was achieved!