Inferring static types from an external source
Empirical is a language for time-series analysis. Uniquely, it has statically typed Dataframes whose types can be inferred from an external source at compile time.
>>> let trades = load("trades.csv")
>>> trades
symbol timestamp price size
AAPL 2019-05-01 09:30:00.578802 210.5200 780
AAPL 2019-05-01 09:30:00.580485 210.8100 390
BAC 2019-05-01 09:30:00.629205 30.2500 510
CVX 2019-05-01 09:30:00.944122 117.8000 5860
AAPL 2019-05-01 09:30:01.002405 211.1300 320
AAPL 2019-05-01 09:30:01.066917 211.1186 310
AAPL 2019-05-01 09:30:01.118968 211.0000 730
BAC 2019-05-01 09:30:01.186416 30.2450 380
CVX 2019-05-01 09:30:01.639577 118.2550 2880
BAC 2019-05-01 09:30:01.867638 30.2450 260
AAPL 2019-05-01 09:30:02.065535 211.1800 260
BAC 2019-05-01 09:30:02.118224 30.2600 300
CVX 2019-05-01 09:30:02.260710 118.3100 1450
BAC 2019-05-01 09:30:02.379882 30.2650 300
AAPL 2019-05-01 09:30:02.422211 211.3300 270
CVX 2019-05-01 09:30:02.439735 118.2900 760
CVX 2019-05-01 09:30:02.869668 118.2700 980
BAC 2019-05-01 09:30:02.987527 30.2350 220
AAPL 2019-05-01 09:30:03.057945 211.4425 300
CVX 2019-05-01 09:30:03.363338 118.5100 990
... ... ... ...
>>> columns(trades)
symbol: String
timestamp: Timestamp
price: Float64
size: Int64
Empirical uses a set of metaprogramming techniques to infer the types. Here is the actual definition of the load()
function, written in Empirical:
data CsvProvider{filename: String} = compile(_csv_infer(filename))
func csv_load{T}(filename: String) -> !T => _csv_load(filename, !T)
func load($ filename: String) => csv_load{CsvProvider{filename}}(filename)
And here is the store()
function:
func store(df, filename: String) => _csv_store(type_of(df), df, filename)
This post will cover what the various symbols above mean, and we will see how Empirical’s own features make type inferencing possible.
Virtual machine
As with many languages, Empirical runs on top of a virtual machine, known as Vector Virtual Machine (VVM). Empirical will show assembly code of what it is doing with --dump-vvm
. A simple assignment like:
>>> var x = 5
becomes an instruction to VVM:
assign 5 i64s @1
This assigns the number 5
, which is a 64-bit integer scalar (i64s
), to the global register @1
.
An expression like:
>>> x * 17
becomes:
mul_i64s_i64s @1 17 %0
This multiplies the x
register and the number 17
, both of which are of type i64s
. The result is stored in a temporary/local register %0
, whose contents will be displayed on the console back to the user.
As the name implies, VVM is vector-aware. Most operations support vector actions. For example:
>>> x * [13, 17, 23]
becomes:
alloc i64v %2
append 13 i64s %2
append 17 i64s %2
append 23 i64s %2
mul_i64s_i64v @1 %2 %3
The code allocates a 64-bit integer vector (i64v
) in local register %2
. It then appends the scalar values one-by-one. Finally, it multiplies the original x
(of type i64s
) and the temporary vector (of type i64v
), storing the result in %3
.
Empirical has two instances of VVM: one for runtime scheduling and one for compile-time (“comptime”) scheduling.
Compile-time function evaluation
Consider this simple assignment:
>>> var y = 43 + 27
This is compiled to:
assign 70 i64s @2
Notice how the 43 + 27
became 70
? Empirical decided that y
’s value could be determined entirely at compile time. (We’ll get into how it decides that a little later.) So Empirical ran the expression on the comptime VVM and scheduled only the result on the runtime VVM.
This feature is commonly known as compile-time function evaluation (CTFE), and Empirical is pretty aggressive about using it. CTFE occurs automatically and happens over any expression whose result (1) can be determined at comptime, and (2) produces a literal value. Empirical employs CTFE over array indices, user-define types and functions, and any other kind of expression.
compile()
Empirical needs CTFE for the related compile()
function, which injects a user’s String
value into the compiler. For example:
>>> var z = compile("x + y")
becomes:
add_i64s_i64s @1 @2 @3
If we had instead used the declaration:
>>> var z = compile("5 + 3")
we would have gotten:
assign 8 i64s @3
The compile()
function requires that the string be known at comptime. So this is fine:
>>> compile("print(x)")
but this is not:
>>> var my_string = "print(x)"
>>> compile(my_string)
We get the following error:
Error: compile() requires a comptime string
Mutability
The var
declaration indicates that a variable is mutable, meaning that it can change during runtime. Because of that, the actual value of a var
is not useable for CTFE; only the register that it maps to can be used in the generated instructions for VVM.
Instead, we can tell Empirical that a value will be constant:
>>> let a = 42
This generates no instructions for the runtime VVM! Instead, the Empirical compiler simply holds onto the value until it is needed later. This allows for some interesting tricks:
>>> let f = "print"
>>> let v = "x"
>>> compile(f + "(" + v + ")")
The resulting instruction is now as we would expect:
print_i64s @1
Function traits and computation modes
This brings us back to the question of how Empirical decides that a value can be determined at comptime. The answer is that Empirical tracks how every expression operates, meaning that the compiler knows, for example, when an expression has a side effect (eg., IO or global variables) and when an expression is pure (purely functional, from a computer science standpoint).
The mechanism is known as function traits, which complement the type system. A trait is a positive quality about a function, such as the function is commutative over its arguments, or that a function returns a vector the same size as its input. A function can have multiple traits.
The trait we are concerned with in this post is pure
.
A related concept is computation mode. Ordinarily we think of user code running as we see it (Normal
mode), but we can imagine re-ordering the sequence of steps, like in lazy evaluation.
The mode we are concerned with is Comptime
.
If a type tells us the kind of data, then the traits and mode tell us the kind of computation. And the rules are pretty straightforward:
- Most of VVM’s operators have a
pure
trait, with the obvious exceptions such as IO, clock access, and random-number generation. - All literal values (eg.,
5
,"cat"
,true
) are bothComptime
andpure
. - Variables declared with
let
inherit traits & mode from the assigned expression;var
is alwaysNormal
and has no traits. - A function is
pure
iff all expressions in the function arepure
. - If a
pure
function is applied to allpure
parameters, then the result ispure
. - If a
pure
function is applied to allComptime
parameters, then the result isComptime
.
So as long as a variable is declared with let
and there are no impure operations (IO, etc.), the value will be Comptime
.
>>> traits_of(x + 7)
none
>>> mode_of(x + 7)
Normal
>>> traits_of(a + 7)
pure, transform, linear
>>> mode_of(a + 7)
Comptime
Templates
Empirical has user-defined types and functions, like every language.
data Person:
name: String,
age: Int64
end
func who_is(p: Person):
return p.name + " is " + String(p.age) + " years old"
end
The above is in statement syntax; Empirical also allows expression syntax.
data Person = {name: String, age: Int64}
func who_is(p: Person) = p.name + " is " + String(p.age) + " years old"
Both types of syntax allow templates. Empirical uses squiggly braces ({}
) instead of angle brackets (<>
) because they are much easier to parse.
>>> data Person{AgeType}: name: String, age: AgeType end
>>> Person{Int64}("A", 1)
name age
A 1
>>> Person{Float64}("A", 1.1)
name age
A 1.1
>>> func who_is{T}(p: T) = p.name + " is " + String(p.age) + " years old"
>>> let p = Person{Int64}("A", 1)
>>> who_is{type_of(p)}(p)
"A is 1 years old"
Type providers
A type provider is a way of programmatically defining a type. Empirical does this by combining templates and compile()
. First, consider VVM’s _csv_infer()
function:
>>> _csv_infer("trades.csv")
"{symbol: String, timestamp: Timestamp, price: Float64, size: Int64}"
This samples the first 10 lines of a CSV file and returns a String
of the inferred type. VVM actually cheats here since _csv_infer()
is listed as pure
, but we can get away with it since the parameter must be Comptime
for the type provider to work anyway.
Now we are ready to revisit our first line of Empirical code from the top:
data CsvProvider{filename: String} = compile(_csv_infer(filename))
This takes a value template (as opposed to a type template). Templates must be given a comptime literal, so CTFE is invoked automatically on the parameter:
>>> CsvProvider{"trades" + ".csv"}
<type: CsvProvider{"trades.csv"}>
The resulting type is inferred at compile time!
Dataframes
It’s worth taking a brief segue here to discuss how Empirical handles Dataframes. As with most analytics platforms, each column in Empirical is just a vector. A simple type can be converted to a Dataframe type by prepending it with an exclamation point (!
).
>>> let p = !Person{Int64}(["A", "B", "C"], [1, 2, 3])
>>> p
name age
A 1
B 2
C 3
>>> p.age
[1, 2, 3]
Now we can begin to understand the second statement from the top. Here is a similar line:
func csv_load{T}(filename: String) = _csv_load(filename, !T)
VVM’s _csv_load()
needs to know the type to use when reading a file. We simply supply it with the Dataframe type from the template (!T
).
Also, we must note that while Empirical can usually infer the return type from a function definition, that doesn’t work here. VVM’s _csv_load()
can return any type, so Empirical must list it explicitly with the single arrow (->
):
func csv_load{T}(filename: String) -> !T = _csv_load(filename, !T)
We can now call this function to get our Dataframe.
>>> csv_load{CsvProvider{"trades.csv"}}("trades.csv")
Macros
Listing "trades.csv"
twice is kind of annoying. Fortunately, we can use a macro to shorten it simply by prepending a dollar sign ($
) to any parameter. Empirical’s macros are syntactic sugar for templates; the function:
foo($ a: Int64, b: Int64)
is really just:
foo{a: Int64}(b: Int64)
The macro parameter, as with any template value parameter, must be a comptime literal.
Our third line of Empirical from the top is similar to:
func load($ filename: String) = csv_load{CsvProvider{filename}}(filename)
This is much more convenient.
Inline functions
Let’s consider how VVM handles user-defined functions. This example:
>>> func inc(x: Int64) = x + 1
>>> inc(8)
becomes:
@4 = def inc("x": i64s) i64s:
add_i64s_i64s %0 1 %1
ret %1
end
call @4 2 8 %11
The inc()
function takes an i64s
(called x
) and returns an i64s
. This function is assigned to global register @4
. When we call it, we must pass two (2
) parameters: the input (8
) and the location of the output (local/temporary register %11
).
This works pretty well, but it’s a lot of overhead for such a simple routine. Fortunately, Empirical has inline functions, which are forced via the double arrow (=>
):
>>> func inc(x: Int64) => x + 1
>>> inc(8)
The entire result is:
add_i64s_i64s 8 1 %11
And with that, we now have our full two lines from the top:
func csv_load{T}(filename: String) -> !T => _csv_load(filename, !T)
func load($ filename: String) => csv_load{CsvProvider{filename}}(filename)
Generics
Templates are great, but they require supplying the type explicitly.
>>> func add{T}(a: T, b: T) => a + b
As with macros, we have some syntactic sugar in the form of generics. By omitting the parameters type, we tell Empirical to derive the type based on the caller:
>>> func add(a, b) => a + b
>>> add(4, 5)
9
>>> add("Hello ", "World")
"Hello World"
Not all parameters need to be generic or specified; users can mix and match as desired. That leads us to our final line from the top:
func store(df, filename: String) => _csv_store(type_of(df), df, filename)
This invokes VVM’s _csv_store()
with the type of the Dataframe.
Putting it all together
By combining a virtual machine, CTFE, traits & modes, templates, macros, generics, and inlining, we can define type-specializing code in Empirical itself.
From a fresh Empirical instance:
>>> let trades = load("trades.csv")
>>> store(trades, "trades2.csv")
becomes:
$0 = {"symbol": Sv, "timestamp": Tv, "price": f64v, "size": i64v}
@2 = "trades.csv"
@3 = "trades2.csv"
load @2 $0 @1
store $0 @1 @3
The trades
variable is mapped to @1
, the inferred type is listed in type definition $0
, and the strings are internalized as @2
and @3
.