Tutorial
- Launching
- Expressions and Variables
- Types
- Arrays
- Functions
- Control Flow
- Timestamps
- Dataframes
- Scripting
Launching
Launch Empirical from the command line to get the REPL.
$ path/to/empirical
Empirical version 0.6.0
Copyright (C) 2019--2020 Empirical Software Solutions, LLC
>>>
Alternatively, include a file name to run it.
$ path/to/empirical file_to_run.emp
There are some advanced options on the command line to see the internal state of the compiler. Get the --help
for the full list.
$ path/to/empirical --help
Magic commands
The REPL has some magic commands that may make development easier. For example, we can time an expression:
>>> \t let p = load("prices.csv")
1ms
We can also load an external Empirical file.
>>> \l my_code.emp
To see all available magic commands, just ask for \help
.
>>> \help
Expressions and Variables
Expressions are evaluated and the results are returned to the user.
>>> 7 + 31
38
>>> 2 * 3 + 10
16
>>> 0xFF
255
>>> 3.14
3.14
Variables are indicated with a let
(immutable) or var
(mutable).
>>> let x = 1
>>> x + 99
100
>>> var y = x + 99
>>> y
100
>>> y = 3
>>> y
3
Types are inferred automatically, but users can denote types explicitly.
>>> let pi: Float64 = 3.1415
Explicit types are required if no initial value is provided.
>>> var user_name: String
>>> user_name = "Charles Babbage"
>>> user_name
"Charles Babbage"
If no initial value and no type are provided, then we have an error.
>>> var user_age
Error: unable to determine type for user_age
Similarly, types must match if both an initial value and an explicit type are provided.
>>> let e: Int64 = 2.71
Error: type of declaration does not match: Int64 vs Float64
Types
All values have a type, resolved at compile time.
>>> let x: Int64 = 37
The type system is static and strict; this prevents common errors.
>>> x + "5"
Error: unable to match overloaded function +
candidate: (Int64, Int64) -> Int64
argument type at position 1 does not match: String vs Int64
candidate: (Float64, Float64) -> Float64
argument type at position 0 does not match: Int64 vs Float64
candidate: (Int64, Float64) -> Float64
argument type at position 1 does not match: String vs Float64
...
<53 others>
A value can be cast to a desired type.
>>> x + Int64("5")
42
If a cast is invalid, then we will have a nil
(integers) or nan
(floating point). The missing data value is propagated by operators.
>>> x + Int64("5b")
nil
User-defined types
Users can define their own values.
>>> data Person: name: String, age: Int64 end
>>> var p = Person("Alice", 37)
These are displayed as a table by default.
>>> p
name age
Alice 37
>>> p.name = "Bob"
>>> p
name age
Bob 37
We can define a type cast if desired.
>>> func String(p: Person) = p.name + " is " + String(p.age) + " years old"
>>> String(p)
"Bob is 37 years old"
Prepending a user-defined type with a bang (!
) changes the type to a Dataframe. All entries will be vectorized.
>>> !Person(["Alice", "Bob"], [37, 39])
name age
Alice 37
Bob 39
User-defined types can accept templates.
>>> data Person2{AgeType}: name: String, age: AgeType end
>>> Person2{Int64}("A", 1)
name age
A 1
>>> !Person2{Float64}(["A", "B"], [1.1, 1.2])
name age
A 1.1
B 1.2
The above examples are in statement syntax. Types can be defined with expression syntax.
>>> data I = Int64
>>> var i: I
>>> i = 17
>>> data Person3 = {name: String, age: Int64}
Templates and expression syntax can be combined for a type provider. This allows for programmatically determining a type.
>>> data Provider{f: String} = compile(f)
>>> let s = "{name: String, age: Int64}"
>>> var obj: Provider{s}
We can always recall the type of an expression.
>>> type_of(x)
<type: Int64>
>>> type_of(i)
<type: Int64>
>>> type_of(x > 7)
<type: Bool>
>>> var y: type_of(x)
>>> y = 7
>>> type_of(Int64)
<type: Kind(Int64)>
>>> Int64
<type: Int64>
>>> type_of(p)
<type: Person>
>>> Person
<type: Person>
>>> Person2
<template>
>>> Person3
<type: Person3>
Arrays
Arrays can be made from any builtin type.
>>> [1, 2, 3]
[1, 2, 3]
>>> let xs: [Float64] = [1., 2., 3.]
>>> xs
[1.0, 2.0, 3.0]
Applying an operator on a vector with a scalar causes the scalar operator to apply to each vector element.
>>> xs * 3.0
[3.0, 6.0, 9.0]
Applying an operator between two vectors is an element-wise application.
>>> xs * [2., 4., 6.]
[2.0, 8.0, 18.0]
Element-wise operations require that the vectors be the same length.
>>> xs * [2., 4.]
Error: Mismatch array lengths
Arrays of consecutive integers can be created from range()
.
>>> range(100)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, ...]
The type can be recalled.
>>> type_of(xs)
<type: [Float64]>
>>> [Float64]
<type: [Float64]>
Functions
Functions apply to a list of arguments.
>>> func add(x, y) = x + y
>>> add(3, 7)
10
>>> add("A", "B")
"AB"
The above is an example of expression syntax, where the function is defined as a single expression. Functions can also be defined with statement syntax with an explicit return
.
>>> func mult(x, y): return x * y end
>>> mult(3, 7)
21
>>> mult(0.1, 0.9)
0.09
Functions are generic by default, meaning that the argument types are determined from the caller. But types can be listed explicitly.
>>> func add2(x: Int64, y: Int64): return x + y end
Functions can be overloaded by parameter type.
>>> func add2(x: Bool, y: Bool): return x or y end
>>> add2(true, false)
true
>>> add2(1, 0)
1
An error occurs if types don’t match during a function call.
>>> add2(3.4, 5.6)
Error: unable to match overloaded function add2
candidate: (Int64, Int64) -> Int64
argument type at position 0 does not match: Float64 vs Int64
candidate: (Bool, Bool) -> Bool
argument type at position 0 does not match: Float64 vs Bool
Generic functions can be specialized by overloaded with a specific type.
>>> func add(a: Char, b: Char) = String(a) + String(b)
>>> add(3, 4)
7
>>> add('a', 'b')
"ab"
Generic functions can have placeholders to provide some degree of strong typing.
>>> func mult2[T](a: [T], b: T) = a * b
>>> mult2([1, 2, 3], 4)
[4, 8, 12]
>>> mult2(1, 4)
Error: argument type at position 0 does not match: Int64 vs [T]
>>> mult2([1, 2, 3], 4.0)
Error: argument type at position 1 does not match: Float64 vs T aka Int64
Operators are just syntactic sugar for a function call.
>>> (+)(3, 5)
8
Operators can be overloaded.
>>> data Point: x: Int64, y: Int64 end
>>> func (+)(p: Point, n: Int64) = Point(p.x + n, p.y + n)
>>> Point(5, 7) + 12
x y
17 19
User-defined literals can be defined by prepending suffix
to any function name.
>>> func suffix_w(x: Int64) = x * 3
>>> 7_w
21
>>> 0xFF_w
765
>>> func suffix_z(x: Float64): return 3.0 * x end
>>> 1.2e4_z
36000.0
The return type is also inferred, but may be listed explicitly.
>>> func add3(x: Int64, y: Int64, z: Int64) -> Int64: return add2(add2(x, y), z) end
>>> add3(4, 5, 6)
15
Functions can, of course, be recursive.
>>> func fac(x: Int64): if x == 0: return 1 else: return x * fac(x - 1) end end
>>> fac(5)
120
Metaprogramming
Functions can take templates.
>>> func mult2{T}(x: T, y: T) = x * y
>>> mult2{Int64}(4, 6)
24
>>> mult2{Float64}(4.0, 6.0)
24.0
The template parameter is a Type
by default, but value parameters are also permitted.
>>> func inc{i: Int64}(x: Int64) = x + i
>>> inc{1}(7)
8
>>> inc{10}(7)
17
A macro is possible by prepending a dollar sign to a parameter name. As with templates, the caller must provide a comptime literal (a simple value, such as a String
or Int64
, that can be derived at compile time).
>>> func inc2($ i: Int64, x: Int64) = x + i
>>> inc2(7, 8)
15
Empirical’s compile-time function evaluation (CTFE) will automatically determine the result of an expression ahead of time if possible.
>>> let v = 100 - 88
>>> inc2(v / 3, 21)
25
A mutable variable is not permitted in CTFE because its value can change. (IO-derived values are also prohibited because their results cannot be determined at compile time.)
>>> var u = 100
>>> inc2(u, 50)
Error: macro parameter i requires a comptime literal
A function can be inlined, meaning that the function body is pasted into the caller’s location. This can speed-up small expressions.
>>> func triple(x: Int64) => x + x + x
>>> triple(7)
21
A function’s type information is available as well.
>>> add
<generic func>
>>> type_of(add)
<type: (_, _) -> _>
>>> add3
<func: add3>
>>> type_of(add3)
<type: (Int64, Int64, Int64) -> Int64>
>>> (+)
<func>
>>> type_of(sum)
<type: overloaded>
>>> inc
<template>
>>> inc2
<macro>
>>> type_of(inc{1})
<type: (Int64) -> Int64>
Control Flow
Loops and conditionals require a boolean expression.
>>> let x = 7
>>> x < 1
false
Conditional expressions can nest via elif
.
>>> func code(c: String): if c == "red": return 'R' elif c == "blue": return 'B' else: return '?' end end
>>> code("green")
'?'
>>> code("blue")
'B'
>>> code("red")
'R'
Loops repeatedly execute until a condition is false.
>>> var y = 0
>>> while y < 10: y = y + 1 end
>>> y
10
Timestamps
Timestamps are stored as nanoseconds-since-epoch and are displayed in human-readable form on the console. (We can get the current timestamp via now()
.)
>>> let t1 = Timestamp("2019-03-24 05:58:55.663131")
>>> let t2 = Timestamp("2019-03-24 05:59:18.980145")
The difference between two timestamps is a Timedelta
.
>>> t2 - t1
Timedelta("00:00:23.317014")
This difference can be added back to any timestamp.
>>> let d1 = t2 - t1
>>> t2 + d1
Timestamp("2019-03-24 05:59:42.297159")
A Timedelta
can be represented via a suffix; available units are ns
, us
, ms
, s
, m
, h
, d
.
>>> 5ms
Timedelta("00:00:00.005")
>>> 4h
Timedelta("04:00:00")
We can use arithmetic with Timedelta
to manipulate a timestamp for aggregation.
>>> bar(t2, 5m)
Timestamp("2019-03-24 05:55:00")
>>> bar(t2, 1d)
Timestamp("2019-03-24 00:00:00")
Date
and Time
can isolate specific parts of a timestamp.
>>> let date = Date(t2)
>>> let time = Time(t2)
>>> date
Date("2019-03-24")
>>> time
Time("05:59:18.980145")
>>> date + time
Timestamp("2019-03-24 05:59:18.980145")
As with any other operator in Empirical, array actions are native.
>>> date + [1d, 10d]
[Date("2019-03-25"), Date("2019-04-03")]
>>> 1d * Timedelta([1, 2, 3])
[Timedelta("1 days"), Timedelta("2 days"), Timedelta("3 days")]
>>> bar([t1, t2], 1s)
[Timestamp("2019-03-24 05:58:55"), Timestamp("2019-03-24 05:59:18")]
Invalid timestamps are nil
.
>>> Timestamp("err")
Timestamp(nil)
Dataframes
All sample CSV files are available to download here or use on repl.it.
Empirical has statically typed Dataframes. The types can be inferred by load()
if the parameter is a comptime literal.
>>> let trades = load("trades.csv"), quotes = load("quotes.csv"), events = load("events.csv")
>>> trades
symbol timestamp price size
AAPL 2019-05-01 09:30:00.578802 210.5200 780
AAPL 2019-05-01 09:30:00.580485 210.8100 390
BAC 2019-05-01 09:30:00.629205 30.2500 510
CVX 2019-05-01 09:30:00.944122 117.8000 5860
AAPL 2019-05-01 09:30:01.002405 211.1300 320
AAPL 2019-05-01 09:30:01.066917 211.1186 310
AAPL 2019-05-01 09:30:01.118968 211.0000 730
BAC 2019-05-01 09:30:01.186416 30.2450 380
CVX 2019-05-01 09:30:01.639577 118.2550 2880
BAC 2019-05-01 09:30:01.867638 30.2450 260
AAPL 2019-05-01 09:30:02.065535 211.1800 260
BAC 2019-05-01 09:30:02.118224 30.2600 300
CVX 2019-05-01 09:30:02.260710 118.3100 1450
BAC 2019-05-01 09:30:02.379882 30.2650 300
AAPL 2019-05-01 09:30:02.422211 211.3300 270
CVX 2019-05-01 09:30:02.439735 118.2900 760
CVX 2019-05-01 09:30:02.869668 118.2700 980
BAC 2019-05-01 09:30:02.987527 30.2350 220
AAPL 2019-05-01 09:30:03.057945 211.4425 300
CVX 2019-05-01 09:30:03.363338 118.5100 990
... ... ... ...
>>> columns(trades)
symbol: String
timestamp: Timestamp
price: Float64
size: Int64
>>> len(trades)
5817
The Dataframe is printed to the console in such a way as to fill the REPL window’s size. Just resize the window to change how much is printed. Also, the Dataframe can be reversed to see the end of it.
>>> reverse(trades)
symbol timestamp price size
BAC 2019-05-01 09:46:21.531340 29.9650 770
BAC 2019-05-01 09:46:20.866846 29.9150 280
BAC 2019-05-01 09:46:20.049704 29.9200 320
BAC 2019-05-01 09:46:19.154440 29.9550 130
BAC 2019-05-01 09:46:18.950952 29.9800 360
BAC 2019-05-01 09:46:18.139979 29.9850 260
BAC 2019-05-01 09:46:17.242817 29.9900 470
BAC 2019-05-01 09:46:17.044261 29.9950 240
BAC 2019-05-01 09:46:16.862898 30.0113 130
BAC 2019-05-01 09:46:16.642879 29.9950 240
BAC 2019-05-01 09:46:15.793486 29.9750 50
BAC 2019-05-01 09:46:14.845249 29.9950 230
BAC 2019-05-01 09:46:13.866494 29.9700 180
BAC 2019-05-01 09:46:13.385324 29.9200 300
BAC 2019-05-01 09:46:13.375352 29.9600 230
BAC 2019-05-01 09:46:13.049421 29.9800 100
BAC 2019-05-01 09:46:12.419462 29.9900 50
BAC 2019-05-01 09:46:12.209153 29.9900 110
BAC 2019-05-01 09:46:11.981420 29.9900 130
BAC 2019-05-01 09:46:11.604512 30.0500 370
... ... ... ...
Queries are builtin.
>>> from trades select where symbol == "AAPL" and size > 1000
symbol timestamp price size
AAPL 2019-05-01 09:37:45.647850 205.0600 1010
AAPL 2019-05-01 09:38:24.754932 204.9200 2010
AAPL 2019-05-01 09:42:57.450065 203.7332 1130
We can also perform aggregations.
>>> from trades select volume = sum(size) by symbol
symbol volume
AAPL 135760
BAC 223590
CVX 507580
Aggregations can take arbitrary expressions, including user-defined functions. Here is an example of a volume-weighted average price (VWAP):
>>> func wavg(ws, vs) = sum(ws * vs) / sum(ws)
>>> from trades select vwap = wavg(size, price) by symbol, bar(timestamp, 5m)
symbol timestamp vwap
AAPL 2019-05-01 09:30:00 210.305724
BAC 2019-05-01 09:30:00 30.483875
CVX 2019-05-01 09:30:00 119.427733
AAPL 2019-05-01 09:35:00 202.972440
BAC 2019-05-01 09:35:00 30.848397
CVX 2019-05-01 09:35:00 119.431601
AAPL 2019-05-01 09:40:00 204.671388
BAC 2019-05-01 09:40:00 30.217362
CVX 2019-05-01 09:40:00 117.224763
AAPL 2019-05-01 09:45:00 206.494583
BAC 2019-05-01 09:45:00 30.070924
CVX 2019-05-01 09:45:00 118.073644
We can sort a Dataframe by a column or an expression. This example sorts by the bid-ask spread:
>>> sort quotes by (ask - bid) / bid
symbol timestamp bid ask
BAC 2019-05-01 09:32:46.313487 30.5650 30.5650
BAC 2019-05-01 09:32:53.738446 30.6124 30.6124
BAC 2019-05-01 09:39:24.459415 31.0600 31.0600
AAPL 2019-05-01 09:45:51.931597 206.9400 206.9500
AAPL 2019-05-01 09:43:59.903292 206.3200 206.3300
BAC 2019-05-01 09:32:50.369746 30.6400 30.6417
CVX 2019-05-01 09:32:57.242072 119.7732 119.7800
AAPL 2019-05-01 09:38:18.980026 205.1100 205.1222
AAPL 2019-05-01 09:38:19.978890 205.1100 205.1251
CVX 2019-05-01 09:37:59.439853 117.5700 117.5800
CVX 2019-05-01 09:37:15.725633 117.5500 117.5600
CVX 2019-05-01 09:44:08.411541 117.5500 117.5600
CVX 2019-05-01 09:37:13.676526 117.5000 117.5100
AAPL 2019-05-01 09:31:52.241969 214.0800 214.1000
CVX 2019-05-01 09:37:46.188810 117.8189 117.8300
AAPL 2019-05-01 09:44:02.188362 206.1700 206.1900
AAPL 2019-05-01 09:44:05.553974 205.8600 205.8800
AAPL 2019-05-01 09:37:25.351114 204.9800 205.0000
AAPL 2019-05-01 09:36:54.176575 204.9100 204.9300
AAPL 2019-05-01 09:41:08.041997 203.9600 203.9800
... ... ... ...
Dataframes can join on one or more columns. They can also join as of a column: for every row in the left table, get the last row in the right table whose timestamp is less than or equal to the timestamp in the left.
>>> join trades, quotes on symbol asof timestamp
symbol timestamp price size bid ask
AAPL 2019-05-01 09:30:00.578802 210.5200 780 210.8000 211.15
AAPL 2019-05-01 09:30:00.580485 210.8100 390 210.8000 211.15
BAC 2019-05-01 09:30:00.629205 30.2500 510 30.2400 30.27
CVX 2019-05-01 09:30:00.944122 117.8000 5860 117.7600 118.34
AAPL 2019-05-01 09:30:01.002405 211.1300 320 210.8000 211.15
AAPL 2019-05-01 09:30:01.066917 211.1186 310 210.8000 211.15
AAPL 2019-05-01 09:30:01.118968 211.0000 730 210.8000 211.15
BAC 2019-05-01 09:30:01.186416 30.2450 380 30.2400 30.27
CVX 2019-05-01 09:30:01.639577 118.2550 2880 118.2600 118.37
BAC 2019-05-01 09:30:01.867638 30.2450 260 30.2300 30.26
AAPL 2019-05-01 09:30:02.065535 211.1800 260 211.1500 211.40
BAC 2019-05-01 09:30:02.118224 30.2600 300 30.2300 30.26
CVX 2019-05-01 09:30:02.260710 118.3100 1450 118.2600 118.54
BAC 2019-05-01 09:30:02.379882 30.2650 300 30.2300 30.26
AAPL 2019-05-01 09:30:02.422211 211.3300 270 211.2433 211.61
CVX 2019-05-01 09:30:02.439735 118.2900 760 118.2600 118.54
CVX 2019-05-01 09:30:02.869668 118.2700 980 118.4800 118.58
BAC 2019-05-01 09:30:02.987527 30.2350 220 30.2300 30.26
AAPL 2019-05-01 09:30:03.057945 211.4425 300 211.2433 211.61
CVX 2019-05-01 09:30:03.363338 118.5100 990 118.4100 118.48
... ... ... ... ... ...
Asof joins can take parameters, such as changing direction or bounding the search.
>>> join trades, events on symbol asof timestamp nearest within 3s
symbol timestamp price size code
AAPL 2019-05-01 09:30:00.578802 210.5200 780
AAPL 2019-05-01 09:30:00.580485 210.8100 390
BAC 2019-05-01 09:30:00.629205 30.2500 510
CVX 2019-05-01 09:30:00.944122 117.8000 5860 a1
AAPL 2019-05-01 09:30:01.002405 211.1300 320
AAPL 2019-05-01 09:30:01.066917 211.1186 310
AAPL 2019-05-01 09:30:01.118968 211.0000 730
BAC 2019-05-01 09:30:01.186416 30.2450 380 e3
CVX 2019-05-01 09:30:01.639577 118.2550 2880 a1
BAC 2019-05-01 09:30:01.867638 30.2450 260 e3
AAPL 2019-05-01 09:30:02.065535 211.1800 260
BAC 2019-05-01 09:30:02.118224 30.2600 300 e3
CVX 2019-05-01 09:30:02.260710 118.3100 1450 a1
BAC 2019-05-01 09:30:02.379882 30.2650 300 e3
AAPL 2019-05-01 09:30:02.422211 211.3300 270
CVX 2019-05-01 09:30:02.439735 118.2900 760 a1
CVX 2019-05-01 09:30:02.869668 118.2700 980 a1
BAC 2019-05-01 09:30:02.987527 30.2350 220 e3
AAPL 2019-05-01 09:30:03.057945 211.4425 300 f7
CVX 2019-05-01 09:30:03.363338 118.5100 990 a1
... ... ... ... ...
Scripting
Empirical scripts get command-line arguments as argv
. Since the argument values are not known at compile time, and because Empirical is a statically typed language, users cannot call load()
. Instead, users must supply an explicit type to the templated function csv_load{}()
. (Fortunately, the type definition can be seen in the REPL ahead-of-time with columns()
.) Results can be saved with store()
.
$ cat aggregate.emp
#!path/to/empirical
# single price row
data Price:
symbol: String,
date: Date,
open: Float64,
high: Float64,
low: Float64,
close: Float64,
volume: Int64
end
# argv[0] is the script name
if len(argv) != 2:
print("Missing path to CSV file")
exit(1)
end
# aggregate volumes from the price file
let prices = csv_load{Price}(argv[1])
let v = from prices select sum(volume) by symbol
store(v, "volumes.csv")
If execute privileges are given to the script (chmod a+x
), then simply run the script on the command line.
$ ./aggregate.emp
Missing path to CSV file
$ ./aggregate.emp prices.csv
$ cat volumes.csv
symbol,volume
AAPL,277096071
BRK.B,33905036
EBAY,95312664