Hannah – a DSL for parsing and generating files and network traces
Computer science knows about formal grammars for specifying the words of a formal language. For a given grammar, one is usually interested in two things:
- Generating words by repeatedly applying the rules of the grammar.
- Parsing words and determining if they are included in the language defined by the grammar.
In this article, we adapt the central idea of formal grammars as a single specification for both generating and parsing words. But instead of using a formal grammar, we introduce a prototypical domain specific language (DSL) called Hannah, which serves as a specification language for files and network traces. In the following, we give an example of a specification of (a subset) of the BMP image file format:
expect-const uint8 0x42; # B
expect-const uint8 0x4d; # M
expect-value "size" uint32;
expect-value "reserved" uint32;
expect-value "image-data-offset" uint32;
expect-const uint32 40; # size of info header
expect-value "width" uint32;
expect-value "height" uint32;
expect-const uint16 1; # number of color planes
expect-const uint16 24; # bits per pixel
expect-const uint32 0; # no compression
expect-value "image-data-size" uint32; # image data size
expect-value "horizontal-pixel-per-meter" int32;
expect-value "vertical-pixel-per-meter" int32;
expect-const uint32 0; # no color palette
expect-const uint32 0; # no important colors
sequence "row" of-length "height"
{
sequence "pixel" of-length "width"
{
expect-value "color" uint8 of-length 3;
}
expect-const uint16 0; # padding
}
Hannah’s specification language defines the memory layout of a file using a small set of primitive statements. For example, the initial expect-const uint8 0x42;
statement in the specification above defines that a BMP file starts with the eight bit unsigned integer value 0x42
. If this assumption does not hold for a given file, it is not considered a BMP file.
In contrast to the expect-const
statement, the expect-value "width" uint32;
statement does not expect a constant value at the current memory location. Instead it binds whatever 32-bit width value is present at the current memory location to the identifier "width"
. The identifier "width"
can then be used for defining other parts of the file format, e.g., the number of color values to read in each row.
Interpreting File Specifications
For certain important classes of formal grammars, one can construct algorithms for both generating new words and to check if a given word is included in the language defined by the grammar. In the following, we illustrate a tool that implements both algorithms for file specifications.
The hannah
tool is a prototypical interpreter for Hannah specification files. Note that hannah
has no built-in knowledge about any file format whatsoever – information related to file formats comes solely from the corresponding specification files.
The hannah
tool provides three different modes. The read
mode checks if a given file complies with any of the stored file specifications. For the first matching specification, hannah
prints all parsed values. For example:
$ hannah read files/image.bmp
bmp:
size: 70
reserved: 0
image-data-offset: 54
width: 2
height: 2
image-data-size: 16
horizontal-pixel-per-meter: 2835
vertical-pixel-per-meter: 2835
row:
pixel:
color: 0 0 255
pixel:
color: 255 0 255
row:
pixel:
color: 255 0 0
pixel:
color: 0 255 0
The write
mode is an interactive mode for generating files according to a choosen specification. The following listing shows an exemplary session where the same BMP image file is generated as was read in the previous example:
$ hannah write out.bmp
specification 0: bmp
specification 1: pcap
Choose specification: 0
Enter size (0-4294967295): 70
Enter reserved (0-4294967295): 0
Enter image-data-offset (0-4294967295): 54
Enter width (0-4294967295): 2
Enter height (0-4294967295): 2
Enter image-data-size (0-4294967295): 16
Enter horizontal-pixel-per-meter (-2147483648-2147483647): 2835
Enter vertical-pixel-per-meter (-2147483648-2147483647): 2835
Enter color[0] (0-255): 0
Enter color[1] (0-255): 0
Enter color[2] (0-255): 255
Enter color[0] (0-255): 255
Enter color[1] (0-255): 0
Enter color[2] (0-255): 255
Enter color[0] (0-255): 255
Enter color[1] (0-255): 0
Enter color[2] (0-255): 0
Enter color[0] (0-255): 0
Enter color[1] (0-255): 255
Enter color[2] (0-255): 0
The modify
mode combines the read
and write
modes by firstly parsing a given file, and then providing the read values as defaults for generating a new file. This is especially useful for rapidly altering only certain values in a given file. For example, in the following session, we change the red, green, and blue components of the last pixel to 100, 150, and 200 respectively.
$ hannah modify files/image.bmp out.bmp
specification 0: bmp
specification 1: pcap
Choose specification: 0
Enter size (0-4294967295) [70]:
Enter reserved (0-4294967295) [0]:
Enter image-data-offset (0-4294967295) [54]:
Enter width (0-4294967295) [2]:
Enter height (0-4294967295) [2]:
Enter image-data-size (0-4294967295) [16]:
Enter horizontal-pixel-per-meter (-2147483648-2147483647) [2835]:
Enter vertical-pixel-per-meter (-2147483648-2147483647) [2835]:
Enter color[0] (0-255) [0]:
Enter color[1] (0-255) [0]:
Enter color[2] (0-255) [255]:
Enter color[0] (0-255) [255]:
Enter color[1] (0-255) [0]:
Enter color[2] (0-255) [255]:
Enter color[0] (0-255) [255]:
Enter color[1] (0-255) [0]:
Enter color[2] (0-255) [0]:
Enter color[0] (0-255) [0]: 100
Enter color[1] (0-255) [255]: 150
Enter color[2] (0-255) [0]: 200
Note how most values were left blank: these value are being copied from the input file files/image.bmp
.
Parsing and Generating Network Traces
The PCAP file format is a commonly used format for saving captured network traces. It is a rather simple format that consists of a small header section:
expect-enum "magic" uint32 hex
[ 0xa1b2c3d4 # same byte order - microsecond resolution
, 0xa1b23d4d # same byte order - nanosecond resolution
, 0xd4c3b2a1 # different byte order - microsecond resolution
, 0x4d3cb2a1 # different byte order - nanosecond resolution
];
byte-order-system ("magic" == 0xa1b2c3d4 || "magic" == 0xa1b23d4d);
expect-value "major-version" uint16;
expect-value "minor-version" uint16;
expect-value "this-timezone" uint32;
expect-const uint32 0;
expect-value "snapshot-length" uint32;
expect-value "data-link-type" uint32
[ 0 -> "Null"
, 1 -> "Ethernet"
, 101 -> "Raw IP"
];
sequence "packet"
{
byte-order-system ("magic" == 0xa1b2c3d4 || "magic" == 0xa1b23d4d);
expect-value "timestamp-seconds" uint32;
expect-value "timestamp-microseconds" uint32;
expect-value "captured-size" uint32;
expect-value "actual-size" uint32;
byte-order-big-endian;
if ("data-link-type" == 1)
{
try ["pcap/ethernet", "pcap/data"];
}
else
{
try ["pcap/data"];
}
}
The header section is followed by a sequence of network frames as they were recorded from a network interface. To handle these frames with the hannah
tool, we need to specify the underlying network protocols. As most network protocols have a layered structure, we specify each layer in a separate specification file. For example, a layer 2 Ethernet frame has the following format:
expect-value "mac-destination" uint8 of-length 6 hex;
expect-value "mac-source" uint8 of-length 6 hex;
expect-enum "ethertype" uint16 hex [ 0x0800 -> "IPv4" ];
if ("ethertype" == 0x0800)
{
try ["pcap/ethernet/ipv4"];
}
On the third layer, we specify the structure of the IPv4 header:
expect-value [4 -> "version", 4 -> "internet-header-length"] uint8;
expect-value [6 -> "DSCP", 2 -> "ECN"] uint8 hex;
expect-value "total-length" uint16;
expect-value "identification" uint16 hex;
expect-value [ 1 -> "reserved"
, 1 -> "do-not-fragment"
, 1 -> "more-fragments"
, 13 -> "fragment-offset"
] uint16 hex;
expect-value "time-to-live" uint8;
expect-enum "protocol" uint8 [ 0x06 -> "TCP" ];
expect-value "header-checksum" uint16 hex;
expect-value "ip-source" uint8 of-length 4;
expect-value "ip-destination" uint8 of-length 4;
if ("internet-header-length" > 5)
{
let "options-length" = ("internet-header-length" - 5) * 4;
expect-data "options" of-length "options-length";
}
if ("protocol" == 0x06)
{
try ["pcap/ethernet/ipv4/tcp"];
}
On the forth layer, we specify the structure of the TCP header:
expect-value "port-destination" uint16;
expect-value "port-source" uint16;
expect-value "sequence-number" uint32;
expect-value "acknowledgement-number" uint32;
expect-value [ 4 -> "data-offset"
, 3 -> "reserved"
, 1 -> "nonce-sum"
, 1 -> "congestion-window-reduced"
, 1 -> "explicit-congestion-notification"
, 1 -> "urgent"
, 1 -> "acknowledgement"
, 1 -> "push"
, 1 -> "reset"
, 1 -> "synchronize"
, 1 -> "fin"
] uint16;
expect-value "window-size" uint16;
expect-value "checksum" uint16;
expect-value "urgent-pointer" uint16;
if ("data-offset" > 5)
{
let "options-length" = ("data-offset" - 5) * 4;
expect-data "options" of-length "options-length";
}
let "payload-length" = "total-length" - (4 * "internet-header-length")
- (4 * "data-offset");
if ("payload-length" > 0)
{
try ["pcap/ethernet/ipv4/tcp/ascii"];
}
The fifth network layer usually contains application specific protocols. We do not specify their structure any further, but expect just a stream of printable ASCII characters instead:
expect-ascii "ascii" of-length "payload-length";
This is obviously works only for ASCII-based application protocols like HTTP.
With such a specification of network frames, we can use the hannah
tool to parse, create, and modify captured network traces that contain HTTP traffic. The validity of the generated traces can be checked using a traffic analyzer like Wireshark.
Finally, we want to give a sample output of a parsed HTTP request:
pcap:
magic: 0xa1b2c3d4
major-version: 2
minor-version: 4
this-timezone: 0
snapshot-length: 262144
data-link-type: Ethernet
packet:
timestamp-seconds: 1510989676
timestamp-microseconds: 47407
captured-size: 420
actual-size: 420
pcap/ethernet:
mac-destination: 0xfe 0x6d 0x8f 0x20 0x61 0x53
mac-source: 0x00 0x00 0x01 0x00 0x00 0x00
ethertype: IPv4
pcap/ethernet/ipv4:
version: 4
internet-header-length: 5
DSCP: 0x00
ECN: 0x00
total-length: 406
identification: 0x4d35
reserved: 0x00
do-not-fragment: 0x01
more-fragments: 0x00
fragment-offset: 0x0000
time-to-live: 64
protocol: TCP
header-checksum: 0x217f
ip-source: 192 168 0 111
ip-destination: 151 101 114 49
pcap/ethernet/ipv4/tcp:
port-destination: 37560
port-source: 80
sequence-number: 2888627534
acknowledgement-number: 3300797811
data-offset: 8
reserved: 0
nonce-sum: 0
congestion-window-reduced: 0
explicit-congestion-notification: 0
urgent: 0
acknowledgement: 1
push: 1
reset: 0
synchronize: 0
fin: 0
window-size: 1444
checksum: 43826
urgent-pointer: 0
options: 01 01 08 0a 03 7d ef 61 21 b6 f9 7a .....}.a !..z
pcap/ethernet/ipv4/tcp/ascii:
ascii: GET / HTTP/1.1
Host: www.rocketleague.com
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Cache-Control: max-age=0
Implementation Details
The hannah
tool is written in the functional programming language Haskell. Haskell is well-suited for building interpreters and compilers because of its rich type system which supports algebraic data types, polymorphism, and much more.
When invoking the hannah
tool, all available specifications are initially parsed. We use the Parsec library that provides several parser combinators that simplify implementing parsers for domain specific languages. The result of the parsing phase is an abstract syntax tree (AST) of all parsed specifications.
Depending on the mode (read
, write
, modify
) with which hannah
has been invoked, the AST is interpreted by one or both available interpreters:
The
Read
interpreter attempts to parse a given file using the available specifications in a backtracking manner. This interpreter keeps track of all parsed values and prints them once a matching specification has been found.The
Write
interpreter writes a file according to the statements given in the specification. Whenever it encounters a branching point, the user is prompted to enter a value in order to decide which branch to take.When using the
modify
mode, both interpreters are combined. Firstly, theRead
interpreter attempts to parse the given file, but instead of printing its results, it passes all parsed values to theWrite
interpreter which provides them as defaults at the branching points. This makes it rather easy for the user to modify existing files.
Future Work
While the prototypical implementation of hannah
is pretty much complete from the point of view of the author, interesting challenges lie ahead. For example, the automatic generation of file parsers for other programming languages seems very promising. Therefore, one would need to implement an interpreter which, instead of parsing or writing a file, prints a program that implements the same behavior as the Read
interpreter does, but in a different language, e.g., C or JavaScript. This frees the developer from manually implementing parsers in languages for which such an implementation would be cumbersome and error prone to write. In order to support a new file format, one would just need to write a specification and run the hannah
tool with the new interpreter.
Update: Hannah-JS is an experimental client-side parser written in JavaScript which has been generated by hannah
.
Specification Language Reference
Types
Name | Description | Range |
---|---|---|
uint8 |
8-bit width unsigned integer | 0 to 255 |
uint16 |
16-bit width unsigned integer | 0 to 65535 |
uint32 |
32-bit width unsigned integer | 0 to 4294967295 |
int8 |
8-bit width signed integer | -128 to 127 |
int16 |
16-bit width signed integer | -32768 to 32767 |
int32 |
32-bit width signed integer | -2147483648 to 2147483647 |
Formats
Typed values are printed as decimals by default. If a hexadecimal is more appropriate for a certain value, use hex
as FORMAT
specifier in the corresponding statement.
Assignments
Values can be assigned to user-defined names using an ASSIGNMENT
specifier in the corresponding statements. For example:
[ 0x01 -> "ICMP"
, 0x06 -> "TCP"
, 0x11 -> "UDP"
, 0x29 -> "IPv6" ]
Expressions
The most primitive expressions are constants and names. Both can be combined using
- the binary operators
+
,-
,*
,/
,&&
,||
,==
,!=
,<
,<=
,>
,>=
, and - the unary operators:
+
,-
.
All operators have their usual semantics. Boolean interpretation of integer values happens implicitly as follows:
== 0
→False
!= 0
→True
The file-position
keyword evaluates to the number of bytes read from the current file.
Statements
expect-const TYPE VALUE;
Expects a constant value of a particular type.
expect-value NAME TYPE FORMAT ASSIGNMENT;
Expects a named value of a particular type. The
FORMAT
andASSIGNMENT
specifiers are optional.expect-value NAME TYPE of-length LENGTH FORMAT ASSIGNMENT;
Expects multiple named values of a particular type. The
LENGTH
is either a constant or a bound name. TheFORMAT
andASSIGNMENT
specifiers are optional.expect-value ASSIGNMENT TYPE FORMAT;
Expects multiple named values which are extracted from a single packed value of a particular type. Each rule
X -> Y
ofASSIGNMENT
assigns the value ofX
continuous bits to the identifierY
. For example, the assignment[ 1 -> "reserved" , 1 -> "do-not-fragment" , 1 -> "more-fragments" , 13 -> "fragment-offset" ]
assigns the most significant bit to the name
"reserved"
, the next bit to the name"do-not-fragment"
, the next bit to the name"more-fragments"
, and the final 13 bits to the name"fragment-offset"
. For non-exhaustive assignments, any remaining bits are not bound to any identifier. TheFORMAT
specifier applies to all values and is optional.expect-enum [VALUE_1, VALUE_2, ...] NAME TYPE FORMAT ASSIGNMENT;
Expects one of the enumerated values of a particular type. The
FORMAT
andASSIGNMENT
specifiers are optional. If anASSIGNMENT
is given, the initial enumeration can be omitted and is inferred from theASSIGNMENT
.expect-data NAME of-length LENGTH;
Expects a named sequence of unspecified data. The
LENGTH
is either a constant or a bound name.expect-ascii NAME of-length LENGTH;
Expects a named sequence of unspecified ASCII data. The
LENGTH
is either a constant or a bound name.sequence NAME { STATEMENT_1; STATEMENT_2; ... }
Evaluates a named sequence of statements until
- the whole file has been read (in
read
mode), or - the user explicitly terminates the loop (in
write
andmodify
mode).
- the whole file has been read (in
sequence NAME of-length LENGTH { STATEMENT_1; STATEMENT_2; ... }
Evaluates a named sequence of statements
LENGTH
times withLENGTH
being a variable or constant.sequence NAME { STATEMENT_1; STATEMENT_2; ... } while (EXPRESSION);
Evaluates a named sequence of statements as long as the post-condition
EXPRESSION
holds.if (EXPRESSION) { STATEMENT_1; ... } else { STATEMENT_2; ... }
Evaluates
STATEMENT_1
ff. ifEXPRESSION
evaluates toTrue
, otherwiseSTATEMENT_2
ff. is evaluated.byte-order-little-endian;
Expects all of the following values in little endian byte order.
byte-order-big-endian;
Expects all of the following values in big endian byte order.
byte-order-system (EXPRESSION);
If
EXPRESSION
evaluates toTrue
, all of the following values are expected in the system’s byte order; otherwise in contrary byte order.let NAME = EXPRESSION;
Evaluates
EXPRESSION
and binds the resulting value toNAME
.try [ SPECIFICATION_1, SPECIFICATION_2, ...];
Sequentially attempts to interpret the given specifications. Fails if no matching specification is found.