A tour of the Vector Virtual Machine

Empirical’s Vector Virtual Machine (VVM) is the core component that makes both a responsive REPL and automated CTFE possible. All Empirical code is compiled to VVM’s register-based bytecode.

Launching Empirical with --dump-vvm will print the disassembled bytecode. For example:

>>> 13 + 17

becomes:

add_i64s_i64s 13 17 %0

The %0 is a temporary/local register that stores the result. If we had assigned the value to a variable, it would likely resolve to @1, a global register. (@0 is argv, the list of user arguments on the command line for a script.)

Instructions

As with any virtual machine, VVM operates on instructions. Each instruction starts with an opcode, like add_i64s_i64s, and continues with a pre-determined number of operands. Addition takes two inputs and one output, so add_i64s_i64s will always take three operands.

Operands are almost always a register, but certain instructions can take a boolean or small integer. Any other literal is stored in the constant pool, a set of global registers:

@1 = "Hello World"
@2 = 1.25
@3 = 4611686018427387904

The opcode is often specialized for an input. The add_i64s_i64s indicates that addition takes two integer 64-bit scalars. As of this writing, there are a total of 52 addition instructions, including:

add_f64v_i64s - floating-point vector and integer scalar
add_Sv_Sv - two string vectors
add_Ts_Dv - timestamp scalar and timedelta vector
add_DAs_TIs - date scalar and time scalar

Concatenating two strings together:

>>> "Hello " + "World"

results in storing the constants before invoking the add instruction:

@4 = "Hello "
@5 = "World"

add_Ss_Ss @4 @5 %10

Type codes

Opcodes can be specialized ahead of time for builtin types, but user-defined types require a different approach. For that reason, some instructions take a type code.

>>> var x: Int64

becomes:

alloc i64s @6

The alloc instruction must be able to take any type. For example:

>>> data Person: name: String, age: Int64 end

>>> var me: Person

becomes:

$0 = {"name": Ss, "age": i64s}

alloc $0 @7

The $0 is a user-defined type code for VVM. The type’s definition (layout and member names) are mapped to a code that can be passed as an operand. Giving the type a parameter means that alloc can handle anything.

Dispatch

Each opcode maps to a C++ function in the runtime. Since there are many opcodes that do similar actions with different inputs, the the corresponding C++ routine is often a templated function. Our add_i64s_i64s invokes:

add_ss<int64_t, int64_t, int64_t>(...);

This C++ function adds the values in the first two operands and stores the result in the third operand. We will investigate the code for this later, but for now we will see how the dispatch takes place.

The bytecode is the set of instructions, and each instruction is an opcode followed by a pre-determined number of operands. By making all opcodes and operands an integer of the same size, we can simply define a set of instructions as an integer array. And indeed that’s what happens in C++:

typedef std::vector<size_t> instructions_t;

To perform the dispatch, we simply jump to the logic that sets up the function call for each opcode:

void dispatch(const instructions_t& code) {
  size_t p;
  ip_ = 0;
  while (true) {
    switch (opcodes(code[ip_])) {
      case opcodes::halt:
        return;
      case opcodes::alloc:
        p = ip_;
        ip_ += 3;
        alloc(code[p + 1], code[p + 2]);
        break;
      ...
      case opcodes::add_i64s_i64s:
        p = ip_;
        ip_ += 4;
        add_ss<int64_t, int64_t, int64_t>(code[p + 1], code[p + 2], code[p + 3]);
        break;
      ...
    }
  }
}

The ip_ above is the instruction pointer, also known as a program counter (PC). It is a global variable that stores the index of our bytecode array. (The reason it’s a global variable is so that branching instructions and function calls can manipulate our location in the bytecode.) By jumping to the appropriate function-call logic for each operand, we can invoke the C++ function with the correct number of arguments.

The halt opcode is a hidden instruction that appears at the end of every function definition and bytecode array. It is generated by the Empirical compiler to tell our interpreter to stop, which means that we don’t need any boundary checking for ip_.

While the above code is pretty fast, there is an optimization in C++ for compilers that support computed gotos. We can use direct threading (which has nothing to do with multithreading), to jump to the function-call logic without needing the branches that are required for the break and while.

void dispatch(const instructions_t& code) {
  static void* opcode_labels[] = { &&halt, &&alloc, ..., &&add_i64s_i64s, ... };

  size_t p;
  ip_ = 0;
  goto *opcode_labels[code[ip_]];

  halt:
    return;
  alloc:
    p = ip_;
    ip_ += 3;
    alloc(code[p + 1], code[p + 2]);
    goto *opcode_labels[code[ip_]];
  ...
  add_i64s_i64s:
    p = ip_;
    ip_ += 4;
    add_ss<int64_t, int64_t, int64_t>(code[p + 1], code[p + 2], code[p + 3]);
    goto *opcode_labels[code[ip_]];
  ...
}

Now each function-call logic is responsible for invoking the next instruction in the bytecode. (It also saves switch from any boundary checking.) Computed gotos are available in Clang and GCC, but not Visual Studio.

Registers

We can see from the dispatch code that the operand is passed to the C++ function. Let’s take a look at one possible implementation of scalar addition:

template<class T, class U, class V>
void add_ss(operand_t left, operand_t right, operand_t result) {
  T& x = get_reference<T>(left);
  U& y = get_reference<U>(right);
  V& z = get_reference<V>(result);
  z = x + y;
}

The register code is simply an index into an array. %1 is really:

local_registers_[1];

And @5 is:

global_registers_[5];

Each individual register is a pointer (void*) whose interpretation is handled by the invoked function. The add_ss() above converts these pointers to C++ references for easier handling. The actual addition is a single C++ add, but there is overhead from indexing the register arrays and dereferencing the pointer.

VVM’s add_i64s_i64s is allowed an embedded integer for the first two values. As we saw earlier:

add_i64s_i64s 13 17 %0

Since these immediate values are not registers, the C++ function must know how to get the values directly. The actual C++ function is more like:

template<class T, class U, class V>
void add_ss(operand_t left, operand_t right, operand_t result) {
  T x = get_value<T>(left);
  U y = get_value<U>(right);
  V& z = get_reference<V>(result);
  z = x + y;
}

The function checks if the parameter is an embedded value first. If the operand is still a register, then it dereferences the pointer for the underlying value. add_ss() is able to do this because it knows it is getting a scalar value, so the copying is fine. (By contrast, the vectorized add_vv() knows it cannot have an immediate value since the operand must be a vector.)

Nil

At this point, it is worth looking at how Empirical handles missing values, which can come from a blank entry in a CSV file, a mismatched join, or a bad cast.

>>> Int64("75")
75

>>> Int64("7b")
nil

VVM represents missing numbers with a sentinel value. Floating points are simply the IEEE nan while integers are the numeric limit (size-appropriate INT_MAX). Determining whether a value is missing is just a comparison. Here is a simplified version:

bool is_nil(int64_t x) {
  return x == std::numeric_limits<int64_t>::max();
}

bool is_nil(double x) {
  return std::isnan(x);
}

The function to sum a vector discards missing values; it is similar to:

template<class T, class U>
void sum_v(operand_t left, operand_t result) {
  std::vector<T>& xs = get_reference<std::vector<T>>(left);
  U& y = get_reference<U>(result);
  y = 0;
  for (auto x: xs) {
    if (!is_nil(x)) {
      y = y + x;
    }
  }
}

Going back to our add_ss() function, we want to ensure that a missing value propagates. So the actual addition is more like:

z = (is_nil(x) || is_nil(y)) ? nil_value<V>() : x + y;

This seems quite slow in a loop. Fortunately there is an optimization: IEEE nan is propagated in hardware! We must check for integers in software, but floating point numbers can be added directly in hardware. For that we have a special function:

constexpr bool is_int_nil(int64_t x) {
  return x == std::numeric_limits<int64_t>::max();
}

constexpr bool is_int_nil(double x) {
  return false;
}

And then we change the addition line to:

z = (is_int_nil(x) || is_int_nil(y)) ? nil_value<V>() : x + y;

With an optimizing C++ compiler, the is_int_nil() guard will disappear for floating points, leaving only the addition. The nan propagation is done entirely in hardware. (I first saw this trick in Python datatable, though I don’t know its origins.)

Repr

Let’s take one more look at our humble addition. If we enable --dump-vvm for our original example, the full disassembled bytecode is:

add_i64s_i64s 13 17 %0
repr %0 i64s %1
save %1

These last two lines are how Empirical and VVM display a result to the console during interactive mode (ie., in the REPL). The repr instruction converts a value (%0) of a given type (i64s) into a string (%1). The save instruction then copies this string to a designated variable in VVM; it is this saved string that is returned when evaluating a user expression.

The repr for a builtin type is pretty straightforward: convert to a string and then surround with whatever is needed to make that string a valid Empirical expression. For example, a timestamp is:

Timestamp("2020-09-03 22:43:05.894561")

The repr for a Dataframe is a lot more involved. It determines the size of the user’s console, which is needed to cut-off the overflowed rows and columns. It generates each column as a vector of strings, and then transposes so that each row can be printed line-by-line. And it must trim excess zeros in unison from floating points and timestamps, which is why those columns line-up their decimal points.

Here are the first few lines of the trades table if we dramatically shrink the console window:

>>> trades
 symbol                  timestamp  price ...
   AAPL 2019-05-01 09:30:00.578802 210.52 ...
   AAPL 2019-05-01 09:30:00.580485 210.81 ...
    BAC 2019-05-01 09:30:00.629205  30.25 ...
    CVX 2019-05-01 09:30:00.944122 117.80 ...
    ...                        ...    ... ...

Expanding the window a little bit gives us:

>>> trades
 symbol                  timestamp    price size
   AAPL 2019-05-01 09:30:00.578802 210.5200  780
   AAPL 2019-05-01 09:30:00.580485 210.8100  390
    BAC 2019-05-01 09:30:00.629205  30.2500  510
    CVX 2019-05-01 09:30:00.944122 117.8000 5860
   AAPL 2019-05-01 09:30:01.002405 211.1300  320
   AAPL 2019-05-01 09:30:01.066917 211.1186  310
   AAPL 2019-05-01 09:30:01.118968 211.0000  730
    ...                        ...      ...  ...

Our repr does this resizing automatically.