Codiga has joined Datadog!

Read the Blog·

Interested in our Static Analysis?

Sign up
← All posts
Julien Delange Wednesday, June 29, 2022

What is static code analysis and how does it work?

Share

AUTHOR

Julien Delange, Founder and CEO

Julien is the CEO of Codiga. Before starting Codiga, Julien was a software engineer at Twitter and Amazon Web Services.

Julien has a PhD in computer science from Universite Pierre et Marie Curie in Paris, France.

See all articles

Static code analysis (or static program analysis) is the process of analyzing computer software that is mostly independent of the programming language and computing environment. It can be done without executing the program (hence the term "static" code analysis). This approach is a common method for detecting security problems and defects in programs written in any programming language. The process is often called static analysis because the program is not executed during analysis. This type of program inspection can be contrasted with dynamic analysis or testing, which involves executing a program or part of it.

Static code analysis tools power Codiga to thousands of code reviews every day. Codiga integrates many tools that support thousands of analysis rules and aggregate their results in order to provide analysis results in just a few seconds.

We want to explain the underlying technology and how static analysis works. In this blog post, we explain what is static code analysis, how it works, and what are the limitations of such an approach.

How static code analysis works?

Static Code Analysis involves two major steps:

  1. Transform the code into an Abstract Syntax Tree (AST)
  2. Apply analysis rules to find potential issues

We first explain what is an abstract syntax tree first and then, explain the process of static code analysis.

The Abstract Syntax Tree (AST)

An Abstract Syntax Tree (AST) is a way to present the structure of a programming language for use in software development. It is used to make the language easier to understand and process by a computer. It shows the structure of code, rather than the syntax or "surface form" that humans typically read. An AST also offers a way to organize programming languages into categories based on their structure.

As its name suggests, an AST uses a tree structure, where everything is a node. A node has only one parent and zero or multiple leaves. A node has a type that represents an expression or literal from the code.

For example, the following picture represents the AST of a function in C that takes two arguments (arg1 and arg2) of type int and returns void.

C code

void myfunc(int arg1, int arg2);

Corresponding Abstract Syntax Tree

Example of an Abstract Syntax Tree

Step 1: Parsing your code and transform it into an Abstract Syntax Tree

Transforming your program into an Abstract Syntax Tree is no easy task. It starts by parsing the code, interpreting its structure, and transforming it into an AST. You can write your own parser, use an established parser, or use frameworks to generate one (such as ANTLR - the most famous parser generator).

There are challenges when parsing code. First, the code you are trying to parse may not be syntactically correct, which leads to parsing errors (and then, no AST is produced). Some parsers are resilient to parsing errors and attempt to produce an AST based on what can be parsed. Another common issue comes from a different version of a language.

As a programming language evolves (and introduces new syntax or keywords), your parser needs to evolve and handle different versions of the language. Some code can be considered as syntactically incorrect whereas it is correct and uses the latest features of a language. A good example of this is Python, when the typing module got introduced (and code with typing annotations would not be processed by parsers supporting the previous version of the language).

Step 2: Walking through the Abstract Syntax Tree

Once you have a complete AST, you can analyze it. Static Analysis Rules analyze the AST and detect potential issues in the code. The code walks through the AST and when a node of a particular type needs to be checked, the code analyzes the node leaves and checks if there is an error.

Let's take the example of a rule that analyzes Python code and checks if the get method from the requests package uses an argument timeout.

This is an example of incorrect code

requests.get(url)

This is an example of correct code

requests.get(url, timeout=2.50)

This is the AST representation of the incorrect code.

AST of the incorrect Python code

The pseudo-code of the rule that checks if the timeout argument is declared would be like this:

boolean checkNode(node) {
  if (node.kind == FunctionCall && node.name == "requests.get") {
    passed = false;
    for(Argument argument: node.arguments) {
      if (argument.name == "timeout") {
          passed = true;
      }
    }
  }
  return passed;
}

Applied to the AST shown above, the rule would then return false because there is only one argument to the function call requests.get with the name url. Only the timeout argument is passed to the requests.get function call, the function checkNode would return true.

Step 3: Designing rules

Checking a large codebase and checking for many errors require writing a rule for each potential error. This is very time-consuming work. Popular static code analyzers (such as PMD, a great static code analyzer for Java) have hundreds of rules pre-configured.

Many static code analyzers (such as eslint) let users extend the tool themselves and define their own rules. They have standard interfaces developers follow to design rules. Refer to the specific static analysis tool you want to extend for these interfaces.

The Codiga Static Code Analysis engine includes thousands of static analysis rules for 12+ programming languages. You can browse these rules below.

Explore Codiga Static Code Analysis Rules

Limitations of Static Code Analysis

Static code analysis is very powerful and flags issues in different categories (including security, safety, and performance). However, there are still many limitations.

First, writing rules is time-consuming since new rules need to be written for every potential issue for each language. Rules are also tool specific, which is very intense in terms of development.

It is also very common to have false positives (e.g. a rule flags an error when there is nothing wrong with the code). It has been one of the biggest issues with static analyzers and developers need to filter the noise in all the potential issues reported by static analysis tools. Our static analysis engine has an extra layer to filter false positives and we also allow users to disable rules for each project.

Last, static analysis tools cannot detect issues that are dependent on the runtime behavior. An issue that occurs on a specific runtime cannot be detected. Similarly, for some languages that have undefined behavior (such as C++), static analysis tools cannot diagnose precisely if a problem will occur.

Wrapping Up

Static code analyzers are very powerful tools and catch a lot of issues in source code. They help developers catch errors in their code every single day. And avoids unsafe or unsecured code from being shipped in production.

Writing a static analysis tool is a hard and time-consuming task. Developers need to write many rules to check for code correctness and such rule can still trigger false positives. Hopefully, existing static code analyzers are very extensible, and instead of writing a tool from scratch, you can add your own rules to existing tools.

Are you interested in Datadog Static Analysis?

Sign up