Tuesday, June 24, 2014

GLSL Classes


GLSL is not object oriented, there are no classes or method calls. GLSL4.0 has subroutine variables that can provide an object oriented solution, but this is not compatible with GLSL in WebGL.

PythonJS GPU Class

In the example below an array of MyObject is created and uploaded to the GPU, where it is iterated over. The method call s.mymethod(1.1, 2.2) translates to MyObject_mymethod(s, 1.1, 2.2) in the GLSL shader.

@gpu.object
class MyObject:
 @gpu.method
 float def subroutine(self, x,y):
  float x
  float y
  return x + y * self.attr2

 @gpu.method
 float def mymethod(self, x,y):
  float x
  float y
  if self.index == 0:
   return -20.5
  elif self.index == 0:
   return 0.6
  else:
   return self.subroutine(x,y) * self.attr1

 def __init__(self, a, b, i):
  self.attr1 = a
  self.attr2 = b
  self.index = int16(i)


class myclass:
 def run(self, w):
  self.array = [MyObject(1.1,1.2,x) for x in range(w)]

  @returns( array=64 )
  @gpu.main
  def gpufunc():
   struct* A = self.array
   float b = 0.0

   for s in iter(A):
    b += s.mymethod(1.1, 2.2)

   return b

  return gpufunc()

Sunday, June 22, 2014

GLSL Array of Arrays


WebGL shader code written in GLSL is based on the OpenGL-ES standard, see GLSL 1.2 spec pdf. GLSL 1.2 has no support for multidimensional arrays (array of arrays), and loops require a constant expression for iteration. These limitations make it very hard to write generic shader programs.

PythonJS shader translation provides a workaround, and supports one-level-deep array of arrays and iteration over them. The input array data and sizes can change at runtime because the shader is fully recompiled each call to its wrapper function. Attributes from the current scope in JavaScript can also be inlined into the shader. Read the syntax documentation here.

array of array example

class myclass:
 def __init__(self, s):
  self.s = s
 def my_method(self):
  return self.s

 def run(self, w, h):
  self.array = [ [x*y*0.5 for y in range(h)] for x in range(w) ]

  @returns( array=64 )
  @gpu.main
  def gpufunc():
   float* A = self.array
   float b = self.my_method()

   for subarray in A:
    for j in range( len(self.array[0]) ):
     b += subarray[j]
   return b

  return gpufunc()

GLSL output

The inner function gpufunc becomes main below. Inside gpufunc above, the assignment to A as a float pointer float* A = self.array triggers the wrapper code to unroll A into A_𝑛 and inline the values, '𝑛' is the length of the array.

 void main() {
 float A_0[4];
 A_0[0]=0.0;A_0[1]=0.0;A_0[2]=0.0;A_0[3]=0.0;float A_1[4];A_1[0]=0.0;A_1[1]=0.5;A_1[2]=1.0;A_1[3]=1.5;float A_2[4];A_2[0]=0.0;A_2[1]=1.0;A_2[2]=2.0;A_2[3]=3.0;float A_3[4];A_3[0]=0.0;A_3[1]=1.5;A_3[2]=3.0;A_3[3]=4.5;float A_4[4];A_4[0]=0.0;A_4[1]=2.0;A_4[2]=4.0;A_4[3]=6.0;float A_5[4];A_5[0]=0.0;A_5[1]=2.5;A_5[2]=5.0;A_5[3]=7.5;float A_6[4];A_6[0]=0.0;A_6[1]=3.0;A_6[2]=6.0;A_6[3]=9.0;float A_7[4];A_7[0]=0.0;A_7[1]=3.5;A_7[2]=7.0;A_7[3]=10.5;
 ...

the wrapper inlines the runtime value of float b = self.my_method(). Iteration over the list for subarray in A: is translated into a for loop that copies the data from A_𝑛 into the iterator target subarray.

    float b;
 b = 0.1;
 for (int _iter=0; _iter < 4; _iter++) {
  float subarray[4];
  if (_iter==0) { for (int _J=0; _J<4; _J++) {subarray[_J] = A_0[_J];} }
  if (_iter==1) { for (int _J=0; _J<4; _J++) {subarray[_J] = A_1[_J];} }
  if (_iter==2) { for (int _J=0; _J<4; _J++) {subarray[_J] = A_2[_J];} }
  if (_iter==3) { for (int _J=0; _J<4; _J++) {subarray[_J] = A_3[_J];} }
  for (int j=0; j < 4; j++) {
   b += subarray[j];
  }
 }
    out_float = b;
 }

array of structs example

To iterate over an array of structs, wrap the struct* with iter in a for loop. The struct below contains a float number 'num' and array of floats 'arr'. The nested loop iterates over the indices of the structs array 'arr'.

class myclass:
 def new_struct(self, g):
  return {
   'num' : g,
   'arr' : [0.1 for s in range(6)]
  }

 def run(self, w):
  self.array = [ self.new_struct( x ) for x in range(w) ]
  @returns( array=64 )
  @gpu.main
  def gpufunc():
   struct* A = self.array
   float b = 0.0
   for s in iter(A):
    b += s.num
    for i in range(len(s.arr)):
     b += s.arr[i]
   return b
  return gpufunc()

GLSL output

The assignment struct* A = self.array triggers the wrapper code to generate a struct typedef that is inserted into the shader header. self.array is inlined at runtime as 𝙎𝙩𝙧𝙪𝙘𝙩𝙉𝙖𝙢𝙚 A_𝑛 Before the struct is constructed the array attribute arr is assigned to the variable _arrA_𝑛. The for loop switches the iterator target s based on the loop index 𝑛.

void main( ) {
    float b;
 b=0.0;

 float _arrA_0[6];_arrA_0[0]=0.1;_arrA_0[1]=0.1;_arrA_0[2]=0.1;_arrA_0[3]=0.1;_arrA_0[4]=0.1;_arrA_0[5]=0.1;
 𝙎𝙩𝙧𝙪𝙘𝙩𝙉𝙖𝙢𝙚 A_0 = 𝙎𝙩𝙧𝙪𝙘𝙩𝙉𝙖𝙢𝙚(0.0,_arrA_0);
 ...
 𝙎𝙩𝙧𝙪𝙘𝙩𝙉𝙖𝙢𝙚 A_6 = 𝙎𝙩𝙧𝙪𝙘𝙩𝙉𝙖𝙢𝙚(6.0,_arrA_6);
  float _arrA_7[6];_arrA_7[0]=0.1;_arrA_7[1]=0.1;_arrA_7[2]=0.1;_arrA_7[3]=0.1;_arrA_7[4]=0.1;_arrA_7[5]=0.1;
 𝙎𝙩𝙧𝙪𝙘𝙩𝙉𝙖𝙢𝙚 A_7 = 𝙎𝙩𝙧𝙪𝙘𝙩𝙉𝙖𝙢𝙚(7.0,_arrA_7);

 for (int _iter=0; _iter < 8; _iter++) {
  𝙎𝙩𝙧𝙪𝙘𝙩𝙉𝙖𝙢𝙚 s;
  if (_iter==0) { s=A_0;}
  if (_iter==1) { s=A_1;}
  if (_iter==2) { s=A_2;}
  if (_iter==3) { s=A_3;}
  if (_iter==4) { s=A_4;}
  if (_iter==5) { s=A_5;}
  if (_iter==6) { s=A_6;}
  if (_iter==7) { s=A_7;}
  b += s.num;
  for (int i=0; i < 6; i++) {
      b += s.arr[i];
     }
    }
    out_float = b;
  }

Wednesday, June 18, 2014

PythonJS GPU Mandelbrot


PythonJS now supports translation of a limited subset of Python syntax into GPU code. This is done with a new GLSL backend using the WebCLGL library by Roberto Gonzalez. The above benchmark calculates the Mandelbrot set for a size of 512x512, CPython3 with NumPy takes over eight seconds to complete, PythonJS specialized version takes just 0.2 seconds.

source code: mandelbrot.py

40X faster is probably not the best that WebCLGL can do, this benchmark was performed with a low-end GPU. To run the benchmarks yourself, install Python2 and Python3, and extract the latest PyPy to your home directory, and install NumPy for each. Install the NodeWebkit NPM package. Then run these commands, and it will save the file /tmp/mandelbrot.py.eps:

cd
git clone https://github.com/3DRoberto/webclgl.git
git clone https://github.com/PythonJS/PythonJS.git
cd PythonJS/regtests
./run ./bench/mandelbrot.py

Friday, June 13, 2014

PythonJS SIMD Vectors part2


The Dart backend has a new class float32vec that encapsulates a list of Float32x4 SIMD vectors. The class acts like a normal vector allowing you to do element wise operations and look-ups, while under-the-hood it will index and operate on the appropriate sub-vector. This allows you to write SIMD accelerated code and not have to manually break things apart into chunks of four. Looping over the sub-vectors and encapsulation adds some overhead and slows down performance for arrays with more than 32 elements. The micro benchmark above was performed with 32 elements and shows better performance than CPython with NumPy.

benchmark source code


float32vec

class float32vec:
 def __init__(self, items):
  self[...] = new( List() )
  self.length = items.length

  i = 0; s = 0
  while i < items.length:
   x = items[i]
   y = items[i+1]
   z = items[i+2]
   w = items[i+3]
   vec = new( Float32x4(x,y,z,w) )
   self[...].add( vec )
   i += 4


 def __getitem__(self, index):
  if index < 0: index = self.length + index

  float32x4 vec = self[...][ index // 4 ]
  lane = index % 4
  if lane == 0: return vec.x
  elif lane == 1: return vec.y
  elif lane == 2: return vec.z
  elif lane == 3: return vec.w

 def __setitem__(self, index, value):
  if index < 0: index = self.length + index

  vec = self[...][ index // 4 ]
  lane = index % 4
  if lane == 0: vec = vec.withX(value)
  elif lane == 1: vec = vec.withY(value)
  elif lane == 2: vec = vec.withZ(value)
  elif lane == 3: vec = vec.withW(value)

  self[...][ index // 4 ] = vec

 def __add__(self, other):
  arr = new( List() )
  for i, vec1 in enumerate( self[...] ):
   vec2 = other[...][ i ]
   arr.add( vec1+vec2 )

  v = inline("new float32vec([])")
  v.length = self.length
  v[...] = arr
  return v

 def __mul__(self, other):
  arr = new( List() )
  for i, vec1 in enumerate( self[...] ):
   vec2 = other[...][ i ]
   arr.add( vec1*vec2 )

  v = inline("new float32vec([])")
  v.length = self.length
  v[...] = arr
  return v


PythonJS SIMD Vectors


PythonJS using direct SIMD via the Dart backend and running in the Dart VM is about 6X faster than CPython with Numpy in the following micro benchmark testing float32x4 multiplication. SIMD stands for single instruction multiple data, and it allows you to instruct the CPU to perform the same math operation on a vector of data to increase performance. Read more about SIMD on my old research blog, here.

I was expecting NumPy would have specialized the case of an array with four float32 elements to use SIMD. Searching around for why this is the case in NumPy, I could not find any clear answers why: [1], [2]. More confused and curious, I jumped into the PyPy IRC chat room, and Matti Picus gave me the answer: NumPy has no direct support for SIMD, instead it relies on helper libraries like: MKL, BLAS, and lapack.

The DartVM includes SIMD and float32x4 and int32x4 primities as part of the core language, you simply import dart:typed_data. Google Chrome and FireFox are also in the process of supporting SIMD.

SIMD multiply micro benchmark

The PythonJS translator has been updated to pass type information to the Dart backend, and translate numpy.array into a Float32x4 vector if it has been typed as float32x4. See my previous blog post about optional static typing, here.

def main():
 start = time()
 float32x4 a = numpy.array( 
  [1.0001, 1.0002, 1.0003, 1.0004], 
  dtype=numpy.float32 )
 float32x4 b = numpy.array( 
  [1.00009, 1.00008, 1.00007, 1.00006], 
  dtype=numpy.float32 )
 float32x4 c = numpy.array( 
  [1.00005, 1.00004, 1.00003, 1.00002], 
  dtype=numpy.float32 )

 arr = []
 for i in range(20000):
  c *= a*b
  arr.append( a*b*c )

 print(time()-start)

Wednesday, June 11, 2014

PythonJS faster than CPython part2


My earlier post PythonJS now faster than CPython struck a nerve with the Python community, and some heated debate on Hacker News. The fact is that CPython is embarrassingly slow compared to V8, and the adoption rate of Python3 is very slow compared to JavaScript's explosive growth.

As pointed out by a few people in Hacker News thread, my first benchmarks were performed with PyPy1.9, so I have updated the test runner to use the latest PyPy 2.3.1 and PyPy1.9. I have also removed the micro-benchmarks, and now only include benchmarks from the Unladen Swallow test suite.

PythonJS (with the fast-backend) is faster than CPython in four of the following five benchmarks. Some may argue that this is not a fair comparison because the fast-backend is only a subset of the Python language standard. Keep in mind that: 1. it is large and useful subset, and 2. PythonJS is designed to allow you to mix both modes with in the same script by blocking out code as fully compliant or fast using special with blocks or calling pythonjs.configure.

nbody

pystone

richards

fannkuch

float

benchmark source code


Python has some big problems not just speed, like running on mobile devices. Translation to JavaScript, and other targets like: C++ and Rust, make it possible to side step the Python interpreter and that makes deployment much simpler.

Saturday, June 7, 2014

64bit integer long type


PythonJS now has a new static type long that can be used to mark a variable as a 64bit integer, see this commit. JavaScript has no native support for native 64bit integers, so the translator will use the Long.js API to construct a Long object and call the appropriate methods for math and comparison logic.

python input

def main():
 long x = 65536
 long y = x * x
 long z = 4294967296
 TestError( y==z )

 long a = z + z
 long b = 8589934592
 TestError( a==b )

 TestError( y < b )
 TestError( b > y )

 TestError( y <= b )
 TestError( b >= y )

javascript output

main = function() {
  if (__NODEJS__==true) var long = require('long');
  
  x = long.fromString("65536");
  y = x.multiply(x);
  z = long.fromString("4294967296");
  TestError( y.equals(z) );
  a = z.add(z);
  b = long.fromString("8589934592");
  TestError( a.equals(b) );
  TestError( y.lessThan(b) );
  TestError( b.greaterThan(y) );
  TestError( y.lessThanOrEqual(b) );
  TestError( b.greaterThanOrEqual(y) );
}

Friday, June 6, 2014

optional static typing


People have been asking for optional types for Python for at least 10 years, and only recently have a couple of good solutions appeared, one is MyPy by Jukka Lehtosalo, but the github project hasn't been updated in several months, and no recent updates on the dev blog either. Python could greatly benefit from optional typing, it makes your code more clear and readable, can provide better compile or time checks, and better performance.

typed_int.py

The above benchmark shows PythonJS in normal mode with statically typed variables is 20 times faster compared to below without statically typed variables.

untyped_int.py

Most of the performance gain comes from typing the arr variable as list, this allows the translator to bypass a runtime method lookup and instead directly call arr.push.

new syntax

def f(arr, x):
  list arr
  int x
  int y = somefunction()
  arr.append( x+y )

PythonJS allows you to type each variable in a function body as a plain statement int x or in an assignment expression int z = somefunction(). This is different from the approach of MyPy where things can only be typed in the function's annotations. It is implemented as a simple text pre-processor, for the details see here, and source code here. This is still a very new feature and can only optimize a few cases. In the future it can be leveraged to further increase performance, and provide translation-time error checking.

Thursday, June 5, 2014

automatic synchronous to async transform


Asynchronous programming in JavaScript can quickly become a mess of callbacks, sometimes it is called `callback hell`. This hell gets worse when you need to move some logic to WebWorkers and pass data back and forth. Your once simple synchronous function must be rewritten into many callbacks that shuffle data around using postMessage and onmessage, and trigger the next callback.

PythonJS allows you to code in a synchronous style, and when your code it translated to JavaScript, it will also be transformed into async callbacks. You can call the sleep function to stop a function for a set amount of time while other things take place, the function will resume after the timeout.

These commits: [1], [2], [3], allow the webworker to directly call functions in the main thread, passing it data as normal function arguments. Under the hood, PythonJS will call postMessage with the function name and arguments. In the main thread this triggers the requested function with the result sent back to the worker. The function in the worker halts until it gets this response from the main thread. What ends up being alot of async code, can be expressed in just a few lines of sync code.

python input

import threading
from time import sleep

shared = []

def blocking_func(x,y):
 shared.append( x )
 shared.append( y )
 return x+y

def async_func( a ):
 shared.append( a )

def main():
 w = threading.start_webworker( worker, [] )
 sleep(1.0)
 assert len(shared)==3
 assert shared[0]==10
 assert shared[1]==20
 assert shared[2]==30
 print('main exit')

## marks this block of code as within the webworker
with webworker:

 def worker():
  ## blocks because the result is assigned to `v`
  v = blocking_func( 10, 20 )
  ## non-blocking because result is not used
  async_func( v )
  self.terminate()

javascript output - main

shared = [];

blocking_func = function(x, y) {
  shared.append(x);
  shared.append(y);
  var __left16, __right17;
  __left16 = x;
  __right17 = y;
  return ((( typeof(__left16) ) == "number") ? (__left16 + __right17) : __add_op(__left16, __right17));
}

async_func = function(a) {
  shared.append(a);
}

main = function() {
  var w;
  w = __start_new_thread(worker, []);
  __run__ = true;
  var __callback0 = function() {
    __run__ = true;
    var __callback1 = function() {
      console.log("main exit");
    }

    if (__run__) {
      setTimeout(__callback1, 1000.0);
    } else {
      if (__continue__) {
        setTimeout(__callback2, 1000.0);
      }
    }
  }

  if (__run__) {
    setTimeout(__callback0, 200.0);
  } else {
    if (__continue__) {
      setTimeout(__callback1, 200.0);
    }
  }
}
worker = "/tmp/worker.js";

javascript output - webworker


onmessage = function(e) {
  if (( e.data.type ) == "execute") {
    worker.apply(self, e.data.args);
    if (! (threading._blocking_callback)) {
      self.postMessage({ "type":"terminate" });
    }
  } else {
    if (( e.data.type ) == "append") {
      __wargs__[e.data.argindex].push(e.data.value);
    } else {
      if (( e.data.type ) == "__setitem__") {
        __wargs__[e.data.argindex][e.data.key] = e.data.value;
      } else {
        if (( e.data.type ) == "return_to_blocking_callback") {
          threading._blocking_callback(e.data.result);
        }
      }
    }
  }
}
self.onmessage = onmessage;

worker = function() {
  var v;
  v = self.postMessage({
    "type":"call",
    "function":"blocking_func",
    "args":[10, 20] 
  });
  var __blocking = function(v) {
    self.postMessage({
       "type":"call",
        "function":"async_func",
        "args":[v] 
    });
    self.postMessage({ "type":"terminate" });
    threading._blocking_callback = null;
  }

  threading._blocking_callback = __blocking;
}

Wednesday, June 4, 2014

PythonJS IDE


Above Pypubjs running in NodeWebkit, with project preview in the Android emulator and native HTML window. Pypubjs is simple example that shows how to write a desktop applications using PythonJS and NodeWebkit. Pypubjs is itself written almost entirely in Python and HTML. It integrates with the python-js NodeJS package to dynamically compile itself, and code you write in the editor.

https://github.com/PythonJS/pypubjs