Hannah – a DSL for parsing and generating files and network traces

Computer science knows about formal grammars for specifying the words of a formal language. For a given grammar, one is usually interested in two things:

In this article, we adapt the central idea of formal grammars as a single specification for both generating and parsing words. But instead of using a formal grammar, we introduce a prototypical domain specific language (DSL) called Hannah, which serves as a specification language for files and network traces. In the following, we give an example of a specification of (a subset) of the BMP image file format:

expect-const uint8 0x42;                          # B
expect-const uint8 0x4d;                          # M

expect-value "size" uint32;
expect-value "reserved" uint32;
expect-value "image-data-offset" uint32;

expect-const uint32 40;                           # size of info header
expect-value "width" uint32;
expect-value "height" uint32;
expect-const uint16 1;                            # number of color planes
expect-const uint16 24;                           # bits per pixel
expect-const uint32 0;                            # no compression
expect-value "image-data-size" uint32;            # image data size
expect-value "horizontal-pixel-per-meter" int32;
expect-value "vertical-pixel-per-meter" int32;
expect-const uint32 0;                            # no color palette
expect-const uint32 0;                            # no important colors

sequence "row" of-length "height"
{
  sequence "pixel" of-length "width"
  {
    expect-value "color" uint8 of-length 3;
  }
  expect-const uint16 0;                          # padding
}

Hannah’s specification language defines the memory layout of a file using a small set of primitive statements. For example, the initial expect-const uint8 0x42; statement in the specification above defines that a BMP file starts with the eight bit unsigned integer value 0x42. If this assumption does not hold for a given file, it is not considered a BMP file.

In contrast to the expect-const statement, the expect-value "width" uint32; statement does not expect a constant value at the current memory location. Instead it binds whatever 32-bit width value is present at the current memory location to the identifier "width". The identifier "width" can then be used for defining other parts of the file format, e.g., the number of color values to read in each row.

Interpreting File Specifications

For certain important classes of formal grammars, one can construct algorithms for both generating new words and to check if a given word is included in the language defined by the grammar. In the following, we illustrate a tool that implements both algorithms for file specifications.

The hannah tool is a prototypical interpreter for Hannah specification files. Note that hannah has no built-in knowledge about any file format whatsoever – information related to file formats comes solely from the corresponding specification files.

The hannah tool provides three different modes. The read mode checks if a given file complies with any of the stored file specifications. For the first matching specification, hannah prints all parsed values. For example:

$ hannah read files/image.bmp
bmp: 
    size: 70
    reserved: 0
    image-data-offset: 54
    width: 2
    height: 2
    image-data-size: 16
    horizontal-pixel-per-meter: 2835
    vertical-pixel-per-meter: 2835
    row: 
        pixel: 
            color: 0 0 255
        pixel: 
            color: 255 0 255
    row: 
        pixel: 
            color: 255 0 0
        pixel: 
            color: 0 255 0

The write mode is an interactive mode for generating files according to a choosen specification. The following listing shows an exemplary session where the same BMP image file is generated as was read in the previous example:

$ hannah write out.bmp
specification 0: bmp
specification 1: pcap
Choose specification: 0
Enter size (0-4294967295): 70
Enter reserved (0-4294967295): 0
Enter image-data-offset (0-4294967295): 54
Enter width (0-4294967295): 2
Enter height (0-4294967295): 2
Enter image-data-size (0-4294967295): 16
Enter horizontal-pixel-per-meter (-2147483648-2147483647): 2835
Enter vertical-pixel-per-meter (-2147483648-2147483647): 2835
Enter color[0] (0-255): 0
Enter color[1] (0-255): 0
Enter color[2] (0-255): 255
Enter color[0] (0-255): 255
Enter color[1] (0-255): 0
Enter color[2] (0-255): 255
Enter color[0] (0-255): 255
Enter color[1] (0-255): 0
Enter color[2] (0-255): 0
Enter color[0] (0-255): 0
Enter color[1] (0-255): 255
Enter color[2] (0-255): 0

The modify mode combines the read and write modes by firstly parsing a given file, and then providing the read values as defaults for generating a new file. This is especially useful for rapidly altering only certain values in a given file. For example, in the following session, we change the red, green, and blue components of the last pixel to 100, 150, and 200 respectively.

$ hannah modify files/image.bmp out.bmp
specification 0: bmp
specification 1: pcap
Choose specification: 0
Enter size (0-4294967295) [70]: 
Enter reserved (0-4294967295) [0]: 
Enter image-data-offset (0-4294967295) [54]: 
Enter width (0-4294967295) [2]: 
Enter height (0-4294967295) [2]: 
Enter image-data-size (0-4294967295) [16]: 
Enter horizontal-pixel-per-meter (-2147483648-2147483647) [2835]: 
Enter vertical-pixel-per-meter (-2147483648-2147483647) [2835]: 
Enter color[0] (0-255) [0]: 
Enter color[1] (0-255) [0]: 
Enter color[2] (0-255) [255]: 
Enter color[0] (0-255) [255]: 
Enter color[1] (0-255) [0]: 
Enter color[2] (0-255) [255]: 
Enter color[0] (0-255) [255]: 
Enter color[1] (0-255) [0]: 
Enter color[2] (0-255) [0]: 
Enter color[0] (0-255) [0]: 100
Enter color[1] (0-255) [255]: 150
Enter color[2] (0-255) [0]: 200

Note how most values were left blank: these value are being copied from the input file files/image.bmp.

Parsing and Generating Network Traces

The PCAP file format is a commonly used format for saving captured network traces. It is a rather simple format that consists of a small header section:

expect-enum "magic" uint32 hex
  [ 0xa1b2c3d4  # same byte order - microsecond resolution
  , 0xa1b23d4d  # same byte order - nanosecond resolution
  , 0xd4c3b2a1  # different byte order - microsecond resolution
  , 0x4d3cb2a1  # different byte order - nanosecond resolution
  ];
byte-order-system ("magic" == 0xa1b2c3d4 || "magic" == 0xa1b23d4d);

expect-value "major-version" uint16;
expect-value "minor-version" uint16;
expect-value "this-timezone" uint32;
expect-const uint32 0;
expect-value "snapshot-length" uint32;
expect-value "data-link-type" uint32
  [ 0   -> "Null"
  , 1   -> "Ethernet"
  , 101 -> "Raw IP"
  ];
sequence "packet"
{
  byte-order-system ("magic" == 0xa1b2c3d4 || "magic" == 0xa1b23d4d);

  expect-value "timestamp-seconds" uint32;
  expect-value "timestamp-microseconds" uint32;
  expect-value "captured-size" uint32;
  expect-value "actual-size" uint32;

  byte-order-big-endian;

  if ("data-link-type" == 1)
  {
    try ["pcap/ethernet", "pcap/data"];
  }
  else
  {
    try ["pcap/data"];
  }
}

The header section is followed by a sequence of network frames as they were recorded from a network interface. To handle these frames with the hannah tool, we need to specify the underlying network protocols. As most network protocols have a layered structure, we specify each layer in a separate specification file. For example, a layer 2 Ethernet frame has the following format:

expect-value "mac-destination" uint8 of-length 6 hex;
expect-value "mac-source" uint8 of-length 6 hex;
expect-enum "ethertype" uint16 hex [ 0x0800 -> "IPv4" ];

if ("ethertype" == 0x0800)
{
  try ["pcap/ethernet/ipv4"];
}

On the third layer, we specify the structure of the IPv4 header:

expect-value [4 -> "version", 4 -> "internet-header-length"] uint8;
expect-value [6 -> "DSCP", 2 -> "ECN"] uint8 hex;
expect-value "total-length" uint16;
expect-value "identification" uint16 hex;
expect-value [ 1  -> "reserved"
             , 1  -> "do-not-fragment"
             , 1  -> "more-fragments"
             , 13 -> "fragment-offset"
             ] uint16 hex;
expect-value "time-to-live" uint8;
expect-enum "protocol" uint8 [ 0x06 -> "TCP" ];
expect-value "header-checksum" uint16 hex;
expect-value "ip-source" uint8 of-length 4;
expect-value "ip-destination" uint8 of-length 4;

if ("internet-header-length" > 5)
{
  let "options-length" = ("internet-header-length" - 5) * 4;
  expect-data "options" of-length "options-length";
}

if ("protocol" == 0x06)
{
  try ["pcap/ethernet/ipv4/tcp"];
}

On the forth layer, we specify the structure of the TCP header:

expect-value "port-destination" uint16;
expect-value "port-source" uint16;
expect-value "sequence-number" uint32;
expect-value "acknowledgement-number" uint32;
expect-value [ 4 -> "data-offset"
             , 3 -> "reserved"
             , 1 -> "nonce-sum"
             , 1 -> "congestion-window-reduced"
             , 1 -> "explicit-congestion-notification"
             , 1 -> "urgent"
             , 1 -> "acknowledgement"
             , 1 -> "push"
             , 1 -> "reset"
             , 1 -> "synchronize"
             , 1 -> "fin"
             ] uint16;
expect-value "window-size" uint16;
expect-value "checksum" uint16;
expect-value "urgent-pointer" uint16;

if ("data-offset" > 5)
{
  let "options-length" = ("data-offset" - 5) * 4;
  expect-data "options" of-length "options-length";
}

let "payload-length" = "total-length" - (4 * "internet-header-length")
                                      - (4 * "data-offset");
if ("payload-length" > 0)
{
  try ["pcap/ethernet/ipv4/tcp/ascii"];
}

The fifth network layer usually contains application specific protocols. We do not specify their structure any further, but expect just a stream of printable ASCII characters instead:

expect-ascii "ascii" of-length "payload-length";

This is obviously works only for ASCII-based application protocols like HTTP.

With such a specification of network frames, we can use the hannah tool to parse, create, and modify captured network traces that contain HTTP traffic. The validity of the generated traces can be checked using a traffic analyzer like Wireshark.

Finally, we want to give a sample output of a parsed HTTP request:

pcap: 
magic: 0xa1b2c3d4
major-version: 2
minor-version: 4
this-timezone: 0
snapshot-length: 262144
data-link-type: Ethernet
packet: 
  timestamp-seconds: 1510989676
  timestamp-microseconds: 47407
  captured-size: 420
  actual-size: 420
  pcap/ethernet: 
    mac-destination: 0xfe 0x6d 0x8f 0x20 0x61 0x53
    mac-source: 0x00 0x00 0x01 0x00 0x00 0x00
    ethertype: IPv4
    pcap/ethernet/ipv4: 
      version: 4
      internet-header-length: 5
      DSCP: 0x00
      ECN: 0x00
      total-length: 406
      identification: 0x4d35
      reserved: 0x00
      do-not-fragment: 0x01
      more-fragments: 0x00
      fragment-offset: 0x0000
      time-to-live: 64
      protocol: TCP
      header-checksum: 0x217f
      ip-source: 192 168 0 111
      ip-destination: 151 101 114 49
      pcap/ethernet/ipv4/tcp: 
        port-destination: 37560
        port-source: 80
        sequence-number: 2888627534
        acknowledgement-number: 3300797811
        data-offset: 8
        reserved: 0
        nonce-sum: 0
        congestion-window-reduced: 0
        explicit-congestion-notification: 0
        urgent: 0
        acknowledgement: 1
        push: 1
        reset: 0
        synchronize: 0
        fin: 0
        window-size: 1444
        checksum: 43826
        urgent-pointer: 0
        options: 01 01 08 0a 03 7d ef 61  21 b6 f9 7a                .....}.a  !..z    
        pcap/ethernet/ipv4/tcp/ascii: 
          ascii: GET / HTTP/1.1
                 Host: www.rocketleague.com
                 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0
                 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
                 Accept-Language: en-US,en;q=0.5
                 Accept-Encoding: gzip, deflate
                 DNT: 1
                 Connection: keep-alive
                 Upgrade-Insecure-Requests: 1
                 Cache-Control: max-age=0

Implementation Details

The hannah tool is written in the functional programming language Haskell. Haskell is well-suited for building interpreters and compilers because of its rich type system which supports algebraic data types, polymorphism, and much more.

When invoking the hannah tool, all available specifications are initially parsed. We use the Parsec library that provides several parser combinators that simplify implementing parsers for domain specific languages. The result of the parsing phase is an abstract syntax tree (AST) of all parsed specifications.

Depending on the mode (read, write, modify) with which hannah has been invoked, the AST is interpreted by one or both available interpreters:

Future Work

While the prototypical implementation of hannah is pretty much complete from the point of view of the author, interesting challenges lie ahead. For example, the automatic generation of file parsers for other programming languages seems very promising. Therefore, one would need to implement an interpreter which, instead of parsing or writing a file, prints a program that implements the same behavior as the Read interpreter does, but in a different language, e.g., C or JavaScript. This frees the developer from manually implementing parsers in languages for which such an implementation would be cumbersome and error prone to write. In order to support a new file format, one would just need to write a specification and run the hannah tool with the new interpreter.

Update: Hannah-JS is an experimental client-side parser written in JavaScript which has been generated by hannah.

Specification Language Reference

Types

Name Description Range
uint8 8-bit width unsigned integer 0 to 255
uint16 16-bit width unsigned integer 0 to 65535
uint32 32-bit width unsigned integer 0 to 4294967295
int8 8-bit width signed integer -128 to 127
int16 16-bit width signed integer -32768 to 32767
int32 32-bit width signed integer -2147483648 to 2147483647

Formats

Typed values are printed as decimals by default. If a hexadecimal is more appropriate for a certain value, use hex as FORMAT specifier in the corresponding statement.

Assignments

Values can be assigned to user-defined names using an ASSIGNMENT specifier in the corresponding statements. For example:

[ 0x01 -> "ICMP"
, 0x06 -> "TCP"
, 0x11 -> "UDP"
, 0x29 -> "IPv6" ]

Expressions

The most primitive expressions are constants and names. Both can be combined using

All operators have their usual semantics. Boolean interpretation of integer values happens implicitly as follows:

The file-position keyword evaluates to the number of bytes read from the current file.

Statements