src-d / uast2clickhouse Goto Github PK

View Code? Open in Web Editor NEW

1.0 6.0 2.0 75 KB

Push flattened Babelfish UASTs to ClickHouse DB.

License: Other

Python 2.27% Go 97.73%

clickhouse uast babelfish code-as-data

uast2clickhouse's Introduction

uast2clickhouse

The CLI tool to put Babelfish's UASTs to ClickHouse DB.

It is written in Go and has zero dependencies. The list of solved problems includes:

Normalizing the UAST even stronger than in the Semantic mode.
Converting a tree structure to a linear list of "interesting" nodes.
Handling runtime errors which are typical to big data processing: OOMs, crashes, DB insertion failures, etc.
Running distributed and unattended.

Installation

You need a Go compiler >=1.11.

export GO111MODULE=on
go build uast2clickhouse

Usage

Install ClickHouse >= 19.4 and initialize the DB schema:

clickhouse-client --query="CREATE TABLE uasts (
  id Int32,
  left Int32,
  right Int32,
  repo String,
  lang String,
  file String,
  line Int32,
  parents Array(Int32),
  pkey String,
  roles Array(Int16),
  type String,
  orig_type String,
  uptypes Array(String),
  value String
) ENGINE = MergeTree() ORDER BY (repo, file, id);

CREATE TABLE meta (
   repo String,
   siva_filenames Array(String),
   file_count Int32,
   langs Array(String),
   langs_bytes_count Array(UInt32),
   langs_lines_count Array(UInt32),
   langs_files_count Array(UInt32),
   commits_count Int32,
   branches_count Int32,
   forks_count Int32,
   empty_lines_count Array(UInt32),
   code_lines_count Array(UInt32),
   comment_lines_count Array(UInt32),
   license_names Array(String),
   license_confidences Array(Float32),
   stars Int32,
   size Int64,
   INDEX stars stars TYPE minmax GRANULARITY 1
) ENGINE = MergeTree() ORDER BY repo;"

Then run on each of the nodes

./uast2clickhouse --heads heads.csv --db default:[email protected]/default /path/to/parquet

./uast2clickhouse --heads heads.csv --db default:[email protected]/default 10.150.0.9:11300

heads.csv contain the mapping from the HEAD UUIDs in Parquet to the actual repository names. If you work with PGA, download it or generate with list-pga-heads. --db default:[email protected]/default is the ClickHouse connection string. 10.150.0.9:11300 is a sample beanstalkd message queue address for distributed processing. You should specify --read-streams and --db-streams to reach the peak performance. --read-streams sets the number of goroutines to read the Parquet file, and --db-streams set the number of HTTP threads which upload the SQL insertions to ClickHouse. Usually --db-streams is bigger than --read-streams. The bigger values increase the memory pressure.

Sample operation

Input: UASTs extracted from PGA'19, 204068 Parquet files overall in a 6 TB Google Cloud volume. DB instance configuration: Google Cloud "highcpu" with 64 cores, 58GB of RAM. 6 local NVMe SSDs joined in RAID0 and formatted to ext4 with journal disabled. Ubuntu 18.04. Worker configuration: Ubuntu 18.04 with 20GB of SSD disk and the UASTs volume attached read-only at /mnt/uasts.

Install and run beanstalkd on the DB instance. Build locally and scp there the beanstool binary.
List all the Parquet files with find /mnt/uasts -name '*.parquet' | gzip > tasks.gz on one of the workers.
scp tasks.gz to the DB instance. zcat tasks.gz | xargs -n1 ./beanstool put --ttr 1000h -t default -b to fill the queue.
Install and setup ClickHouse on the DB instance. There are sample /etc/clickhouse-server/config.xml and /etc/clickhouse-server/users.xml.
Execute the pushing procedure in 4 stages.

16 workers, 2 cores, 4 GB RAM each. ./uast2clickhouse --read-streams 2 --db-streams 6 --heads heads.csv --db default:[email protected]/default 10.150.0.9:11300. This succeeds with ~80% of the tasks. Then ./beanstool kick --num NNN -t default.
16 workers, 2 cores, 4 GB RAM each. ./uast2clickhouse --read-streams 1 --db-streams 1 --heads heads.csv --db default:[email protected]/default 10.150.0.9:11300. This succeeds with all but 1k tasks.
16 workers, 2 cores, 16 GB RAM each ("highmem"). Same command. This leaves only ~10 tasks.
2 workers, 4 cores, 32 GB RAM each ("highmem"). Same command, full success.

Create the secondary DB indexes.

SET allow_experimental_data_skipping_indices = 1;
ALTER TABLE uasts ADD INDEX lang lang TYPE set(0) GRANULARITY 1;
ALTER TABLE uasts ADD INDEX type type TYPE set(0) GRANULARITY 1;
ALTER TABLE uasts ADD INDEX value_exact value TYPE bloom_filter() GRANULARITY 1;
ALTER TABLE uasts ADD INDEX left (repo, file, left) TYPE minmax GRANULARITY 1;
ALTER TABLE uasts ADD INDEX right (repo, file, right) TYPE minmax GRANULARITY 1;
ALTER TABLE uasts MATERIALIZE INDEX lang;
ALTER TABLE uasts MATERIALIZE INDEX type;
ALTER TABLE uasts MATERIALIZE INDEX value_exact;
ALTER TABLE uasts MATERIALIZE INDEX left;
ALTER TABLE uasts MATERIALIZE INDEX right;
OPTIMIZE TABLE uasts FINAL;

The whole thing takes ~1 week.

Tests

There are sadly no tests at the moment. We are going to fix this.

License

Apache 2.0, see LICENSE.

uast2clickhouse's People

Contributors

Stargazers

Watchers

Forkers

jeroenherczeg isabella232

uast2clickhouse's Issues

Report on discrepancies and bugs

I parsed a small subset of parquet files and analyzed the DB I obtained in order to see how we could improve the extraction. This is by no means extensive, but already covers at least some of the generic things we should expect. I will go trough the languages I analyzed, list my observations and then propose possible improvements:

Observations

Go

Imports are doubled. This is a problem stemming from the UAST structure, as the list of import nodes appears twice, once in the attribute Imports of the root node, and a second time nested in nodes of the attribute Decls which have an attribute Tok with value Import
Literals are stored with type go:BasicLit
When declaring a variable with var my_var my_var_type, my_var has type go:ValueSpec

Python

Keywords return, assert, if, else, elif are kept and stored with distinct python:... types
Operators all have distinct python:... types, possibly multiple ones if done in some cases (I found python:Sub and python:USub, both for the operator -)
Numbers are stored with type python:Num, and None has it's own type as well: python:NoneLiteral
when using class my_class, my_class is stored with type python:ClassDef

C++

Comments written before all imports are done are not present in the UAST, so are not present either in the DB. I'm guessing this is either a Babelfish bug (or feature).
Literals are stored with type cpp:CPPASTLiteralExpression
Operators are stored with type cpp:CPPASTBinaryExpression
Types are kept and stored with type cpp:CPPASTSimpleDeclSpecifier or uast:Identifier if used to declare a variable (except void and auto)
I found issues with the line, probably linked to newlines, but its pretty random. However, the order is coherent, and tokens in a similar line have the same line number, so in most if not all applications it should not affect us.
When the phrase using something is used, we get 2 rows, one for something and a second one which always has type cpp:CPPASTUsingDeclaration and value ICPPASTUsingDeclaration.NAME - Qualified Name brought into scope.
When the syntax some_va::operator some_op is used, we get a lot of duplication, including nodes with type cpp:CPPASTQualifiedName containing this whole string. Not getting too much in details as it varied a lot, but for instance:

# the line below yields 46 rows
return array_t::operator[](x + y * length() + z * area());
# instead of  ( 5 identifiers + 4 ops + 1 "operator []") = 10 rows

When the phrase extern "C" is used, we get a row with value C and type cpp:CPPASTLinkageSpecification
When defining macros with #define SOME_THING() ...; syntax, the nodes are stored with type cpp:ASTFunctionStyleMacroDefinition then when used later they pop up either with uast:Identifier type, or more rarely with cpp:ASTUndef
When just using #define SOME_THING, the node is stored witg type cpp:ASTMacroDefinition

Java

Booleans are stored with type java:BooleanLiteral
if and else are stored with type java:If/ElseStatement
Keywords like protected, public, private, static or final are kept and stored with type java:Modifier
Numbers are stored with type java:NumberLiteral type and null has it's own
type: java:NullLiteral
package keyword is kept and stored with type java:PackageDeclaration
Primitive types are kept and stored with type java:PrimitiveType

C#

Types all have their own type, eg csharp:IntKeyword or csharp:ByteKeyword
Keywords all have their own type as well, e.g. csharp:CatchKeyword , csharp:ClassKeyword
Ops all have their own type as well, e.g. csharp:ClassKeyword for * or csharp:ExclamationEqualsToken for !=
Unlike all other languages I've seen so far, punctuation is kept, all with their own type, e.g. csharp:CloseBraceToken
Numbers are stored with type csharp:NumericLiteralToken type and null has it's own
type: csharp:NullKeyword

Possible improvements

Dedoubling Go imports: could be done at multiple moments, and would remove half of the relevant rows
Filter out nodes with type cpp:CPPASTUsingDeclaration
Look into cpp:CPPASTQualifiedName issue and resolve deduplication. Ideally, we should have for token::operator [] one row with value token and type uast:Identifier and one row with value [] and type uast:Operator if we want to normalize, or something more specific if we want to keep the info.
Filter out nodes that represent C# punctuation
Map Numeric Literals to a custom uast:NumericLiteral type
Map language specific booleans to the existing uast:Bool
Map language specific operators to the existing uast:Operator
Map language specific None/Null types to a custom uast:Null
Filter out language specific keywords, unless you think there is some value to keep them
Filter out types, unless you think there is some value to keep them

Fix how we handle the Names attribute

Currently, when a node is being traversed, we check multiple attributes to find its value. To do so, we rely on the assumption that there will be only one value at best, and check, successively:

The Name/name field.
The Text/text field
The Value/value field
Finally, the Names attribute. We handle this one differently: if the field is an array of nodes, which it always is, then we join the value of each of these node's Name attribute, if they have one.

In order to avoid duplication, we then ignore the Names attribute in the goDeeper function. While this provides utility, it has 2 risks, one of which I am certain exists:

If a node has a value held in one of the 3 first attributes and a non-empty Names attribute, we will not traverse the nodes in it. I have not (yet) found an example for this, but it is a possibility.
2 . If Names contain proper nodes that are nested, thus not having a Name attribute. This is the case in the following example:

from foo import bar as baz

Here, the uast:RuntimeImport node has the following structure:

a Path attribute with a single uast:Identifier node, with value foo
a Names attribute with a single uast:Alias node, with a Node attribute containing the uast:Identifier node with value bar, and a Name attribute containing uast:Identifier node with value baz

Now, unfortunately this is not a Babelfish bug, as this structure for aliases is always the same: we replace a uast:Identifier node with a uast:Alias node that has a Name and a Node attribute. Also, this makes sense, as in the following snippet the Names attribute would have
2 alias nodes instead of one:

from foo import bar as baz, bar2 as baz2

Anyway, this is a problem, because currently we loose this information. In the first snippet, only the foo identifier is kept. I am going to check out what we can do and will push a PR when a proper solution is found. I think we should start dealing with import nodes in a specific way, and just go deeper on the Names attribute in all other cases, but I've got to look more into this before.