Tuesday, June 24, 2014

GLSL Classes


GLSL is not object oriented, there are no classes or method calls. GLSL4.0 has subroutine variables that can provide an object oriented solution, but this is not compatible with GLSL in WebGL.

PythonJS GPU Class

In the example below an array of MyObject is created and uploaded to the GPU, where it is iterated over. The method call s.mymethod(1.1, 2.2) translates to MyObject_mymethod(s, 1.1, 2.2) in the GLSL shader.

@gpu.object
class MyObject:
 @gpu.method
 float def subroutine(self, x,y):
  float x
  float y
  return x + y * self.attr2

 @gpu.method
 float def mymethod(self, x,y):
  float x
  float y
  if self.index == 0:
   return -20.5
  elif self.index == 0:
   return 0.6
  else:
   return self.subroutine(x,y) * self.attr1

 def __init__(self, a, b, i):
  self.attr1 = a
  self.attr2 = b
  self.index = int16(i)


class myclass:
 def run(self, w):
  self.array = [MyObject(1.1,1.2,x) for x in range(w)]

  @returns( array=64 )
  @gpu.main
  def gpufunc():
   struct* A = self.array
   float b = 0.0

   for s in iter(A):
    b += s.mymethod(1.1, 2.2)

   return b

  return gpufunc()

Sunday, June 22, 2014

GLSL Array of Arrays


WebGL shader code written in GLSL is based on the OpenGL-ES standard, see GLSL 1.2 spec pdf. GLSL 1.2 has no support for multidimensional arrays (array of arrays), and loops require a constant expression for iteration. These limitations make it very hard to write generic shader programs.

PythonJS shader translation provides a workaround, and supports one-level-deep array of arrays and iteration over them. The input array data and sizes can change at runtime because the shader is fully recompiled each call to its wrapper function. Attributes from the current scope in JavaScript can also be inlined into the shader. Read the syntax documentation here.

array of array example

class myclass:
 def __init__(self, s):
  self.s = s
 def my_method(self):
  return self.s

 def run(self, w, h):
  self.array = [ [x*y*0.5 for y in range(h)] for x in range(w) ]

  @returns( array=64 )
  @gpu.main
  def gpufunc():
   float* A = self.array
   float b = self.my_method()

   for subarray in A:
    for j in range( len(self.array[0]) ):
     b += subarray[j]
   return b

  return gpufunc()

GLSL output

The inner function gpufunc becomes main below. Inside gpufunc above, the assignment to A as a float pointer float* A = self.array triggers the wrapper code to unroll A into A_𝑛 and inline the values, '𝑛' is the length of the array.

 void main() {
 float A_0[4];
 A_0[0]=0.0;A_0[1]=0.0;A_0[2]=0.0;A_0[3]=0.0;float A_1[4];A_1[0]=0.0;A_1[1]=0.5;A_1[2]=1.0;A_1[3]=1.5;float A_2[4];A_2[0]=0.0;A_2[1]=1.0;A_2[2]=2.0;A_2[3]=3.0;float A_3[4];A_3[0]=0.0;A_3[1]=1.5;A_3[2]=3.0;A_3[3]=4.5;float A_4[4];A_4[0]=0.0;A_4[1]=2.0;A_4[2]=4.0;A_4[3]=6.0;float A_5[4];A_5[0]=0.0;A_5[1]=2.5;A_5[2]=5.0;A_5[3]=7.5;float A_6[4];A_6[0]=0.0;A_6[1]=3.0;A_6[2]=6.0;A_6[3]=9.0;float A_7[4];A_7[0]=0.0;A_7[1]=3.5;A_7[2]=7.0;A_7[3]=10.5;
 ...

the wrapper inlines the runtime value of float b = self.my_method(). Iteration over the list for subarray in A: is translated into a for loop that copies the data from A_𝑛 into the iterator target subarray.

    float b;
 b = 0.1;
 for (int _iter=0; _iter < 4; _iter++) {
  float subarray[4];
  if (_iter==0) { for (int _J=0; _J<4; _J++) {subarray[_J] = A_0[_J];} }
  if (_iter==1) { for (int _J=0; _J<4; _J++) {subarray[_J] = A_1[_J];} }
  if (_iter==2) { for (int _J=0; _J<4; _J++) {subarray[_J] = A_2[_J];} }
  if (_iter==3) { for (int _J=0; _J<4; _J++) {subarray[_J] = A_3[_J];} }
  for (int j=0; j < 4; j++) {
   b += subarray[j];
  }
 }
    out_float = b;
 }

array of structs example

To iterate over an array of structs, wrap the struct* with iter in a for loop. The struct below contains a float number 'num' and array of floats 'arr'. The nested loop iterates over the indices of the structs array 'arr'.

class myclass:
 def new_struct(self, g):
  return {
   'num' : g,
   'arr' : [0.1 for s in range(6)]
  }

 def run(self, w):
  self.array = [ self.new_struct( x ) for x in range(w) ]
  @returns( array=64 )
  @gpu.main
  def gpufunc():
   struct* A = self.array
   float b = 0.0
   for s in iter(A):
    b += s.num
    for i in range(len(s.arr)):
     b += s.arr[i]
   return b
  return gpufunc()

GLSL output

The assignment struct* A = self.array triggers the wrapper code to generate a struct typedef that is inserted into the shader header. self.array is inlined at runtime as 𝙎𝙩𝙧𝙪𝙘𝙩𝙉𝙖𝙢𝙚 A_𝑛 Before the struct is constructed the array attribute arr is assigned to the variable _arrA_𝑛. The for loop switches the iterator target s based on the loop index 𝑛.

void main( ) {
    float b;
 b=0.0;

 float _arrA_0[6];_arrA_0[0]=0.1;_arrA_0[1]=0.1;_arrA_0[2]=0.1;_arrA_0[3]=0.1;_arrA_0[4]=0.1;_arrA_0[5]=0.1;
 𝙎𝙩𝙧𝙪𝙘𝙩𝙉𝙖𝙢𝙚 A_0 = 𝙎𝙩𝙧𝙪𝙘𝙩𝙉𝙖𝙢𝙚(0.0,_arrA_0);
 ...
 𝙎𝙩𝙧𝙪𝙘𝙩𝙉𝙖𝙢𝙚 A_6 = 𝙎𝙩𝙧𝙪𝙘𝙩𝙉𝙖𝙢𝙚(6.0,_arrA_6);
  float _arrA_7[6];_arrA_7[0]=0.1;_arrA_7[1]=0.1;_arrA_7[2]=0.1;_arrA_7[3]=0.1;_arrA_7[4]=0.1;_arrA_7[5]=0.1;
 𝙎𝙩𝙧𝙪𝙘𝙩𝙉𝙖𝙢𝙚 A_7 = 𝙎𝙩𝙧𝙪𝙘𝙩𝙉𝙖𝙢𝙚(7.0,_arrA_7);

 for (int _iter=0; _iter < 8; _iter++) {
  𝙎𝙩𝙧𝙪𝙘𝙩𝙉𝙖𝙢𝙚 s;
  if (_iter==0) { s=A_0;}
  if (_iter==1) { s=A_1;}
  if (_iter==2) { s=A_2;}
  if (_iter==3) { s=A_3;}
  if (_iter==4) { s=A_4;}
  if (_iter==5) { s=A_5;}
  if (_iter==6) { s=A_6;}
  if (_iter==7) { s=A_7;}
  b += s.num;
  for (int i=0; i < 6; i++) {
      b += s.arr[i];
     }
    }
    out_float = b;
  }

Wednesday, June 18, 2014

PythonJS GPU Mandelbrot


PythonJS now supports translation of a limited subset of Python syntax into GPU code. This is done with a new GLSL backend using the WebCLGL library by Roberto Gonzalez. The above benchmark calculates the Mandelbrot set for a size of 512x512, CPython3 with NumPy takes over eight seconds to complete, PythonJS specialized version takes just 0.2 seconds.

source code: mandelbrot.py

40X faster is probably not the best that WebCLGL can do, this benchmark was performed with a low-end GPU. To run the benchmarks yourself, install Python2 and Python3, and extract the latest PyPy to your home directory, and install NumPy for each. Install the NodeWebkit NPM package. Then run these commands, and it will save the file /tmp/mandelbrot.py.eps:

cd
git clone https://github.com/3DRoberto/webclgl.git
git clone https://github.com/PythonJS/PythonJS.git
cd PythonJS/regtests
./run ./bench/mandelbrot.py

Friday, June 13, 2014

PythonJS SIMD Vectors part2


The Dart backend has a new class float32vec that encapsulates a list of Float32x4 SIMD vectors. The class acts like a normal vector allowing you to do element wise operations and look-ups, while under-the-hood it will index and operate on the appropriate sub-vector. This allows you to write SIMD accelerated code and not have to manually break things apart into chunks of four. Looping over the sub-vectors and encapsulation adds some overhead and slows down performance for arrays with more than 32 elements. The micro benchmark above was performed with 32 elements and shows better performance than CPython with NumPy.

benchmark source code


float32vec

class float32vec:
 def __init__(self, items):
  self[...] = new( List() )
  self.length = items.length

  i = 0; s = 0
  while i < items.length:
   x = items[i]
   y = items[i+1]
   z = items[i+2]
   w = items[i+3]
   vec = new( Float32x4(x,y,z,w) )
   self[...].add( vec )
   i += 4


 def __getitem__(self, index):
  if index < 0: index = self.length + index

  float32x4 vec = self[...][ index // 4 ]
  lane = index % 4
  if lane == 0: return vec.x
  elif lane == 1: return vec.y
  elif lane == 2: return vec.z
  elif lane == 3: return vec.w

 def __setitem__(self, index, value):
  if index < 0: index = self.length + index

  vec = self[...][ index // 4 ]
  lane = index % 4
  if lane == 0: vec = vec.withX(value)
  elif lane == 1: vec = vec.withY(value)
  elif lane == 2: vec = vec.withZ(value)
  elif lane == 3: vec = vec.withW(value)

  self[...][ index // 4 ] = vec

 def __add__(self, other):
  arr = new( List() )
  for i, vec1 in enumerate( self[...] ):
   vec2 = other[...][ i ]
   arr.add( vec1+vec2 )

  v = inline("new float32vec([])")
  v.length = self.length
  v[...] = arr
  return v

 def __mul__(self, other):
  arr = new( List() )
  for i, vec1 in enumerate( self[...] ):
   vec2 = other[...][ i ]
   arr.add( vec1*vec2 )

  v = inline("new float32vec([])")
  v.length = self.length
  v[...] = arr
  return v


PythonJS SIMD Vectors


PythonJS using direct SIMD via the Dart backend and running in the Dart VM is about 6X faster than CPython with Numpy in the following micro benchmark testing float32x4 multiplication. SIMD stands for single instruction multiple data, and it allows you to instruct the CPU to perform the same math operation on a vector of data to increase performance. Read more about SIMD on my old research blog, here.

I was expecting NumPy would have specialized the case of an array with four float32 elements to use SIMD. Searching around for why this is the case in NumPy, I could not find any clear answers why: [1], [2]. More confused and curious, I jumped into the PyPy IRC chat room, and Matti Picus gave me the answer: NumPy has no direct support for SIMD, instead it relies on helper libraries like: MKL, BLAS, and lapack.

The DartVM includes SIMD and float32x4 and int32x4 primities as part of the core language, you simply import dart:typed_data. Google Chrome and FireFox are also in the process of supporting SIMD.

SIMD multiply micro benchmark

The PythonJS translator has been updated to pass type information to the Dart backend, and translate numpy.array into a Float32x4 vector if it has been typed as float32x4. See my previous blog post about optional static typing, here.

def main():
 start = time()
 float32x4 a = numpy.array( 
  [1.0001, 1.0002, 1.0003, 1.0004], 
  dtype=numpy.float32 )
 float32x4 b = numpy.array( 
  [1.00009, 1.00008, 1.00007, 1.00006], 
  dtype=numpy.float32 )
 float32x4 c = numpy.array( 
  [1.00005, 1.00004, 1.00003, 1.00002], 
  dtype=numpy.float32 )

 arr = []
 for i in range(20000):
  c *= a*b
  arr.append( a*b*c )

 print(time()-start)

Wednesday, June 11, 2014

PythonJS faster than CPython part2


My earlier post PythonJS now faster than CPython struck a nerve with the Python community, and some heated debate on Hacker News. The fact is that CPython is embarrassingly slow compared to V8, and the adoption rate of Python3 is very slow compared to JavaScript's explosive growth.

As pointed out by a few people in Hacker News thread, my first benchmarks were performed with PyPy1.9, so I have updated the test runner to use the latest PyPy 2.3.1 and PyPy1.9. I have also removed the micro-benchmarks, and now only include benchmarks from the Unladen Swallow test suite.

PythonJS (with the fast-backend) is faster than CPython in four of the following five benchmarks. Some may argue that this is not a fair comparison because the fast-backend is only a subset of the Python language standard. Keep in mind that: 1. it is large and useful subset, and 2. PythonJS is designed to allow you to mix both modes with in the same script by blocking out code as fully compliant or fast using special with blocks or calling pythonjs.configure.

nbody

pystone

richards

fannkuch

float

benchmark source code


Python has some big problems not just speed, like running on mobile devices. Translation to JavaScript, and other targets like: C++ and Rust, make it possible to side step the Python interpreter and that makes deployment much simpler.

Saturday, June 7, 2014

64bit integer long type


PythonJS now has a new static type long that can be used to mark a variable as a 64bit integer, see this commit. JavaScript has no native support for native 64bit integers, so the translator will use the Long.js API to construct a Long object and call the appropriate methods for math and comparison logic.

python input

def main():
 long x = 65536
 long y = x * x
 long z = 4294967296
 TestError( y==z )

 long a = z + z
 long b = 8589934592
 TestError( a==b )

 TestError( y < b )
 TestError( b > y )

 TestError( y <= b )
 TestError( b >= y )

javascript output

main = function() {
  if (__NODEJS__==true) var long = require('long');
  
  x = long.fromString("65536");
  y = x.multiply(x);
  z = long.fromString("4294967296");
  TestError( y.equals(z) );
  a = z.add(z);
  b = long.fromString("8589934592");
  TestError( a.equals(b) );
  TestError( y.lessThan(b) );
  TestError( b.greaterThan(y) );
  TestError( y.lessThanOrEqual(b) );
  TestError( b.greaterThanOrEqual(y) );
}