Skip to main content

module analysis::m3::Core

rascal-0.40.16

M3 common source code model represent facts extracted from source code for use in downstream metrics or other analyses.

Usage

import analysis::m3::Core;

Dependencies

import Message;
import Set;
import IO;
import util::FileSystem;
import analysis::graphs::Graph;
import Node;
import Map;
import List;
import Relation;
extend analysis::m3::TypeSymbol;

Description

The M3 Core defines basic concepts such as:

  • qualified names: we use locations to model qualified names for each programming language
  • containment: which artifacts are contained in which other artifacts
  • declarations: where artifacts are defined
  • uses: where declared artifacts are used
  • types: which artifacts has which types

From this Core is supposed to be extended with features specific for a programming language. See for example Java M3.

Benefits

  • Qualified names in the shape of a location are a uniform and generic way of identifying source code artifacts, that can be extended across languages, projects, and versions.
  • M3 helps standardizing the shape of facts we extract from source code for all different languages, limiting the element of surprise.
  • When we use M3 for many languages, common IDE features are made reusable (such as clicking from an extracted fact to the code that generated it).
  • Some downstream analyses may be reusable between different languages if they all map to M3.

Pitfalls

  • Even though different languages may map to the same M3 model, this does not mean that the semantics is the same. Downstream metrics or other analysis tools should still take semantic differences between programming languages into account.

data M3

An M3 model is a composable database of ground-truth facts about a specific set of source code artifacts.

data M3 (
set[Language] languages = {},
rel[loc name, loc src] declarations = {},
set[loc] implicitDeclarations = {},
rel[loc name, TypeSymbol typ] types = {},
rel[loc src, loc name] uses = {},
rel[loc from, loc to] containment = {},
list[Message] messages = [],
rel[str simpleName, loc qualifiedName] names = {},
rel[loc definition, loc comments] documentation = {},
rel[loc definition, Modifier modifier] modifiers = {}
)
= m3(loc id)
;

This m3 data constructor holds all information to an M3 model. It is identified by the id field, which should be a unique name for the project or file or composition that the M3 model was constructed for.

Practically all relations in an M3 model relate source locations of the loc type:

  1. Name locations are logical locations that represent fully qualified names of declared artefacts.
    • For example: |java+method:///java/util/List/toString()|
    • Name locations are always indicated with the column name name in any relation below.
  2. Source location are physical locations that point to an exact (part of) a source code file:
    • For example: |project://jre13/src/main/java/java/util/List.java|(100,350,<20,0>,<25,10>)
    • Source locations are always indicated with the column name src in any relation below.

These are the core facts stored in M3 models because 90% of all programming languages have these core features:

Ground truth fact kind about source codeDescription
set[Language]describes the languages this model contains information about, including their version numbers for the sake of transparency
rel[loc name, loc src] declarationsmaps qualified names of relations to their original source location in the current model, if any.
rel[loc src, loc name] usesas the inverse of declarations this maps every source location where a declared artefact is used to its fully qualified name.
set[loc] implicitDeclarationsprovides a set of qualified names of things that are present no matter what in a programming language, for completeness sake.
rel[loc from, loc to] containmentlinks the qualified name of the outer (from) declaration to the names of everything that is declared inside of it (to).
rel[loc name, TypeSymbol typ] typesakin to the classical symbol table, this relation maps fully qualified names to a TypeSymbol representation of their static type.
rel[str simpleName, loc qualifiedName] namesis for producing human/user readable messages about declared artefacts; every fully qualified name {c,sh,w}ould have one.
list[Message] messagescollects the errors and warnings produced the parser/compiler that populated this model.
rel[loc definition, loc comments] documentation`links documentation strings (comments) inside the source code to specific declarations. A typical example would be JavaDoc comments to a class definition.
rel[loc definition, Modifier modifier] modifierslinks modifiers to fully qualified declarations (typically access modifiers like public or private or storage modifiers such as static)

More relations would be added by M3 model builders for specific programming paradigms.

Benefits

  • Logical name locations are both a readable and optimally accurate references to specific source code artefacts. No accidental confusion by mixing namespaces.
  • Binary relations on locations are easily composed to infer new and interesting facts.
    • In particular the composition operator and comprehensions can be used to easily deduce or infer more facts;
    • Composing declarations o uses immediately generates a detailed dependency graph
    • Composing uses o declarations immediately produces a jump-to-definition graph, while its inverse (uses o declarations)<1,0> produces a references graph.
  • M3 models never use maps because those are not safely compositional (one maps could overwrite the facts of another).
  • Specific programming paradigms and languages may add new facts to the M3 relation.
    • For Java and C++ there would be class extension and interface implementation relations, for example.
    • PHP would add a relation to link classes to traits, etc. etc.
  • Every relation, set, list of facts in an M3 model is composable by union or concatenation. This makes an entire model composable by composing every item, respectively. The Compose M3 function implements such a union.
    • Composition can be used to easily construct project-level models from file-level models.
    • Composition can be used to simulate (dynamic) linkage between projects.
    • Composition can be used to start simulating remote-procedure calls and shared memory, and other inter-programming language composition like JNI.
  • M3 models can be cached (efficiently) on disk using functions from Value IO. A single stored M3 model simulates an object file, while a composed M3 model is more like an .a archive or a .jar archive.
    • Integrating M3 model caching during a build process (e.g ANT, Makefiles or Maven) is a smart way to make whole program analysis fast and incremental.
    • Integrating M3 model caching in Integrated Development Environments (e.g. the Language Server Protocol) enables fast and incremental IDE features based on whole program indexing that M3 provides.

Pitfalls

  • Initial M3 models should not contain inferred information, only ground truth data as extracted from parse trees or abstract syntax trees, and facts from the static name and type resolution stages of a compiler or interpreter.
    • Inference is certainly possible (say to construct an over-approximated call graph), but that is not what we call an M3 model.
    • The reason is that metrics of over- and under-approximated abstract interpretations of programs quickly loose their tractability and understandability, and also in (the education of) empirical scientific methods it is of grave importance to separate facts from heuristic inference.
  • Simply calling Compose M3 does not immediately represent the full static semantics of program composition. Namely, what the union of facts, as implemented by Compose M3 means depends on programming language semantics. Sometimes to connect the merged models also new connections must be made programmatically to complete the connections. Such analyses are static simulations of the linking and loading stages of programming languages. When we simulate static composition, these analyses are ground truth, but when we simulate dynamic loading we have to treat the results as heuristic inferences.
  • Not every programming language front-end that creates M3 models has to have implemented all the above relations (yet). Constructing such a front-end may take time and incrementally growing models can already be very useful.
  • Even though M3 models can have errors and be partially populated, please be aware that partially correct programs lead to partically correct models and all downstream analysis is correspondingly inaccurate.
  • In statically types programming languages the declarations relation is typically one-to-one and the uses relation is many-to-one, which means that name resolution is unique at compile-time. However this is not required for other more dynamic languages, and this is fine. You will see that one qualified name could potentionally resolve to different artefacts at run-time. This will be reflected by the uses relation also having many-to-many tuples in it. Be careful how you count, for example, dependencies or coupling in such cases since we are literally already over-approximating the reality of the running program.

data Language

Extensible data-type to define language names and their versions.

data Language (str version = "") 
= generic()
;

Most ground truth facts about source code require analysis tooling that is specific to the language:

  • parsers
  • name analysis
  • type analysis

However, there are language analysis methods that are language agnostic such as counting lines of code. For this we have the generic() language name.

function composeM3

Generic function to compose the facts of a set of M3s into a single model.

M3 composeM3(loc id, set[M3] models)

We iterate over all the facts stored in every model, and use set union or list concatenation to collect the elements of all relations and lists.

Benefits

  • Composition satisfies the requirements for many downstream analyses:
    • Composition can be used to easily construct project-level models from file-level models, e.g. for open-source project analysis.
    • Composition can be used to simulate (dynamic) linkage between projects, e.g. for whole-program analysis.
    • Composition can be used to start simulating remote-procedure calls and shared memory, and other inter-programming language composition like JNI.
  • Transitive closure on composed models leads to effective (and fast) reachability analysis.
  • This function is rather memory-efficient by iterating over the already in-memory keyword parameter sets and lists, and splicing the unions into an incremental transiently growing set or list via the comprehension. This avoids a lot of copying and intermediate memory allocation which can be detrimental when doing large whole program analyses.

Pitfalls

  • If the quality of the qualified names in the original models is lacking, than this is the moment that different declarations might be conflated with the same fully qualified name. All downstream analysis is broken then.
  • This function does not compose the extended facts for specific programming languages yet.
  • If extended M3 models use something other than sets, lists or relations, this composition function ignores them completely.
  • Composed models can be huge in memory. Make sure to allocate enough heap for the JVM. Real world programs of real world product can take gigabytes of memory, even when compressed and optimized as M3 models.

function diffM3

Generic function to apply a difference over the annotations of a list of M3s.

M3 diffM3(loc id, list[M3] models)

function modifyM3

M3 modifyM3(loc id, list[M3] models, value (&T,&T) fun)

function isEmpty

bool isEmpty(M3 model)

function files

set[loc] files(M3 model)

function containmentToFileSystem

Transform the containment relation to a recursive tree model.

set[FileSystem] containmentToFileSystem(M3 model)

This makes the containment relation into an abstract File System for further analysis, or visualization.

Benefits

  • Transforming the containment relation to a tree model allows further analysis using operators such as visit and descendant matching (/) which is sometimes more convenient.
  • The tree shape is better for visualization purposes.

function checkM3

list[Message] checkM3(M3 model)

function m3SpecificationTest

Specification to test the quality of M3 models that specific language front-ends produce.

bool m3SpecificationTest(M3 m, bool closedWorld=false, bool covering=false)

Based on the language agnostic relations in the M3 model, this function tries to validate the internal consistency of an M3 model.

If an M3 instance is a closedWorld model, this means that there are no uses in the model that are not declared in the current model in declarations. A closed world model allows for more stringent consistenct checks than a model that depends on external declarations. By selecting closedWorld=true those additional checks are enabled, otherwise these are ignored or weakened accordingly. It is advisable to provide at least one closed model per programming language front-end, while testing against this spec.

By covering we mean that everything that is declared in the model is also used at least once. This is a simple check for knowing if the test covers the language in some form.

Benefits

  • Front-end construction is tricky business. This test provides a sanity check before users start depending on fawlty models.

Pitfalls

  • In closedWorld many things can be strictly checked, but in an open world with dependencies outside of the current model the validation is much weaker.