Challenges lexing and parsing MATLAB/Octave

Lexing

On the lexical level MATLAB (R) and Octave are probably the one of the most challenging languages ever invented that aren't weird on purpose (e.g. Malbolge). We posit that a key reason for this is that the actual MATLAB (R) parser combines lexing and parsing in one step; which allows for some constructs that make it impossible to use a dragon-style autogenerated lexer. They are the only languages I know of where the lexing is context sensitive with potentially unlimited lookahead.

The Lexer in MISS_HIT more or less complete and works correctly. The following difficult features/bugs of the MATLAB (R) are fully supported:

Adding anonymous commas in matrices, so that the parser will always see a comma-separated row and does not have to care about whitespace. So [1 1] will be lexed to [1, 1].
Distinguishing ' (transpose) from ' (single-quoted character array)
Support for lambda functions inside matrix/cells (think about the extra special treatment of whitespace in {@(x) 1})
Support for weird matrices (e.g. [1 +1] and [1 ++1] are lexed correctly)
Distinguishing between matrices and assignment lists
Command form transformation (e.g. foo bar (baz) is lexed as identifier foo, char array bar, and char array (baz). Including the bugs/weirdness like f) o[[ b lexing as a single string. Including the really bizarre stuff like f''] % lexing as f] (with a trailing space).
Classdef blocks (e.g. foo uint8 is not lexed as command form, but correctly as two identifiers).
Block comments (with all the unspecified weirdness of nesting them or including text on the same line).

The notable missing features can be found in our lexer issues.

Parsing

Parsing is comparatively easy when compared to lexing. The main issue is that the official documentation does not include a grammar for most language features, so we had to infer one from examples, existing code, and carefully written tests.

The parser is mostly complete, but right now it only parses (no semantic analysis). Three of the most starred projects on github using MATLAB (R) or Octave are processed without parse errors (matconvnet, matlab2tikz, MatlabFunc).

I am working on ironing out some of the more obscure bugs before moving on to basic semantic analysis (starting with name resolution and basic type analysis).

Semantics

The number one issue. The MathWorks has not published semantics for MATLAB, and the documentation is extremely poor for issues surrounding language semantics. For example, the description for division never mentions division by zero.

The plan is to come up with a language subset MISS_HIT that is a safe over-approximation of MATLAB (R), and publish that as the intended sematics. This can be combined with a carfully constructed qualification test-suite that can demonstrate that any given version of MATLAB (R) performs correctly on the MISS_HIT subset.

This approach is more or less what you do when you qualify a tool or compiler for use in your safety critical project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHALLENGES.md

CHALLENGES.md

Challenges lexing and parsing MATLAB/Octave

Lexing

Parsing

Semantics

Files

CHALLENGES.md

Latest commit

History

CHALLENGES.md

File metadata and controls

Challenges lexing and parsing MATLAB/Octave

Lexing

Parsing

Semantics