bblfsh / python-driver Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 16.0 5.04 MB

License: GNU General Public License v3.0

Python 90.38% Shell 0.44% Go 8.75% Dockerfile 0.43%

babelfish driver python

python-driver's People

Contributors

Stargazers

Watchers

Forkers

pombredanne mcarmonaa chubbymaggie abeaumont afcarl dennwc tsolakoua bzz juanjux taper21 lwsanty dhanush breitfelder stg-tud arielferdman pbehnke

python-driver's Issues

PRIMITIVE role meaning for Python

The documentation says:

    // Primitive is a language builtin.
    Primitive

Python has a lot of built-in names, but since I can override any built-in function it is not possible to say if this name built-in or not.

Can you please clarify Primitive role meaning for Python?

There is no simple identifier of module when I import submodule

I run this code:

from bblfsh.client import BblfshClient
bc = BblfshClient("0.0.0.0:9432")
res = bc.parse("example.py", language="Python")
print(res)

where example.py is

import matplotlib.pyplot

Here is full output:

uast {
  internal_type: "Module"
  children {
    internal_type: "Import"
    properties {
      key: "internalRole"
      value: "body"
    }
    children {
      internal_type: "Import.names"
      properties {
        key: "promotedPropertyList"
        value: "true"
      }
      children {
        internal_type: "alias"
        properties {
          key: "asname"
          value: "<nil>"
        }
        token: "matplotlib.pyplot"
        start_position {
          line: 1
          col: 1
        }
        roles: IMPORT_PATH
        roles: SIMPLE_IDENTIFIER
      }
      roles: 142
    }
    start_position {
      line: 1
      col: 1
    }
    roles: IMPORT_DECLARATION
    roles: STATEMENT
  }
  start_position {
    line: 1
    col: 1
  }
  roles: FILE
}

Why don't I have matplotlib as a simple identifier?
Is it correct?

Duplication of IDENTIFIER role

I found it in the output of this example https://gist.github.com/zurk/8f7dd974347925ae62c31d9441491613 for issue #94.

for self token we have role duplication:

<self>: ['IDENTIFIER', 'QUALIFIED', 'IDENTIFIER', 'EXPRESSION']

or for class1:
<class1>: ['IDENTIFIER', 'QUALIFIED', 'IDENTIFIER', 'EXPRESSION']

Wrong Role assignment for ""

If you extract UAST for

""""

You have next tree:

#  Token  Internal Role  Roles Tree                 
                                                    
   ||     Module         FILE                       
1  ||     Expr           ┣ EXPRESSION               
1  ||     Str            ┗ ┗ BYTE, TUPLE, EXPRESSION

What I expect to see is:

#  Token  Internal Role  Roles Tree                 
                                                    
   ||     Module         FILE                       
1  ||     Expr           ┣ EXPRESSION               
1  ||     Str            ┗ ┗ ~BYTE~, ~TUPLE~, EXPRESSION, +STRING+, +VALUE+

Legend:

+ROLE+ -- add Role
~ROLE~ -- remove Role
?ROLE? -- maybe add/remove Role

Gist to generate UAST Roles visualization: https://gist.github.com/zurk/d314d67d9aac8843d3776c82cd738b40

Wrong Role assignment for {1:2}

If you extract UAST for

{1:2}

You have next tree:

#  Token  Internal Role  Roles Tree                                     
                                                                        
   ||     Module         FILE                                           
1  ||     Expr           ┣ EXPRESSION                                   
1  ||     Dict           ┃ ┣ BYTE, NULL, EXPRESSION                     
1  |1|    Num            ┃ ┃ ┣ BYTE, REGEXP, EXPRESSION, NULL, PRIMITIVE
1  |2|    Num            ┗ ┗ ┗ BYTE, REGEXP, EXPRESSION, NULL, VALUE

What I expect to see is:

#  Token  Internal Role  Roles Tree                                     
                                                                        
   ||     Module         FILE                                           
1  ||     Expr           ┣ EXPRESSION                                   
1  ||     Dict           ┃ ┣ ~BYTE~, ~NULL~, EXPRESSION, ?TYPE?                     
1  |1|    Num            ┃ ┃ ┣ ~BYTE~, ~REGEXP~, EXPRESSION, ~NULL~, ~PRIMITIVE~, +NUMBER+, +VALUE+
1  |2|    Num            ┗ ┗ ┗ ~BYTE~, ~REGEXP~, EXPRESSION, ~NULL~, +NUMBER+, +VALUE+

Legend:

+ROLE+ -- add Role
~ROLE~ -- remove Role
?ROLE? -- maybe add/remove Role

Gist to generate UAST Roles visualization: https://gist.github.com/zurk/d314d67d9aac8843d3776c82cd738b40

Parsing failed due to Exception: Could not determine Python version

Using latest https://gist.github.com/bzz/c0c3dbcab5fecbe48e22167e2ad78595 UAST parsing fails on what seems to be https://github.com/damoeb/kalipo/blob/master/kalipo-ir/harvester/spiders/heise_spider.py

Serve log

time="2017-06-21T13:44:13Z" level=debug msg="sending ParseUAST request: Filename:"kalipo-ir/harvester/spiders/heise_spider.py" Language:"python" Content:"import scrapy\nfrom scrapy.contrib.spiders import CrawlSpider, Rule\nfrom scrapy.contrib.linkextractors import LinkExtractor\nfrom scrapy.selector import Selector\n\nfrom harvester.items import Comment\nimport time\nimport calendar\nimport re\n\nclass HeiseSpider(CrawlSpider):\n    name = \"heise\"\n    allowed_domains = [\"www.heise.de\"]\n    start_urls = [\n            \"http://www.heise.de/forum/Telepolis/Kommentare/Ohne-Vorratsdatenspeicherung-sterben-vermisste-Kinder-und-Suizidale/forum-242979/\"\n    ]\n\n    rules = (\n        #Rule(LinkExtractor(allow=('/tp/foren/[^/]+/forum-[0-9]+/list'))),\n\tRule(LinkExtractor(allow=('/posting-[0-9]+/show')), callback='parse_item')\n    )\n\n    def clean_str(self, val):\n\treturn val.replace(u'\\xa0', u' ').strip()\n\n    def to_str(self, arr):\n\treturn self.clean_str(''.join(arr))\n\n    def parse_date(self, val):\n\tgrps = re.search('[0-9]+\\. ([A-Za-z]+) [0-9]{4} [0-9]{2}:[0-9]{2}', val)\n\n\tmnth = grps.group(1)\n\n   \tmonths = ['Januar', 'Februar', 'M\\u00e4rz', 'April', 'Mai', 'Juni', 'Juli', 'August', 'September', 'Oktober', 'November', 'Dezember']\n\tfor index, item in enumerate(months):\n\t   if item.lower() == mnth.lower():\n\t      val = val.replace(mnth, str(index))\n\t      break\n\n\treturn calendar.timegm(time.strptime(val, \"%d. %m %Y %H:%M\"))\n\n    def parse_item(self, response):\n        sel = Selector(response)\n\n\n\tisRoot = len(response.xpath(\"//ul[@class='forum_navi'][2]/li\")) == 6\n\n\tif !isRoot:\n\t   # find parent\n\t   parent = response.xpath(\"//span[@class='active_post']/../../../parent::ul[@class='nextlevel_line']/preceding-sibling::div[@class='hover_line']\")\n\t   # get link\n\t   link = parent.xpath(\".//div[@class='thread_title']/a\")\n\t   # extract parent id from href\n\n\n\titem = Comment()\n\titem['text'] = self.to_str(sel.xpath(\"//h3[@class='posting_subject']/text()\").extract()) + self.to_str(sel.xpath(\"//p[@class='posting_text']/text()\").extract())\n\titem['url'] = response.url\n\titem['parent'] = 'unknown'\n\titem['level'] = 0\n\titem['thread'] = re.search('forum-([0-9]+)', response.url).group(1)\n\titem['author'] = self.to_str(sel.xpath(\"//div[@class='user_info']/i//text()\").extract())\n\titem['date'] = self.parse_date(self.to_str(response.xpath(\"//div[@class='posting_date']/text()\").extract()))\n        return item\n\n" "
time="2017-06-21T13:35:14Z" level=error msg="driver bblfsh/python-driver:latest (01BK5BZ6N1S7MZBCSFPADDBFSW) stderr: ERROR:root:Filepath: , Errors: ['Traceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/python_driver/requestprocessor.py", line 151, in process_request\n    raise Exception(\'Could not determine Python version\')\nException: Could not determine Python version\n']"

Client logs

Read kalipo-ir/harvester/spiders/heise_spider.py, 2247 bytes	Parsing file:'kalipo-ir/harvester/spiders/heise_spider.py'

Panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1186147]

goroutine 1 [running]:
github.com/bblfsh/sdk/uast.(*Node).ProtoSize(0x0, 0xc4201f9f50)
	/go/src/github.com/bblfsh/sdk/uast/generated.pb.go:510 +0x37
github.com/bblfsh/sdk/uast.(*Node).Marshal(0x0, 0x1c18070, 0xc4200102c0, 0xc4201f9f20, 0x0, 0x0)
	/go/src/github.com/bblfsh/sdk/uast/generated.pb.go:352 +0x2f
main.main.func1(0xc4201c4e60, 0xc4201c4e60, 0x0)
	/go/src/github.com/src-d/analysis-pipeline/juanjo/pyFromGit2ast2pb.go:69 +0x4d5

Node positions not being according with specs

It is not retrieved start_position.offset field nor end_position as documented by https://doc.bblf.sh/uast/specification.html
All token positions on the same line has the same start_position

[...] it is guaranteed that nodes in a UAST either have no position attached or they have a position with valid offset, line and col. [...]
End position, if present in a token node, is the position of the last character of the token in the original source code.

example:

import sys
sys.stdout.write("Hello world!\n")

client: client-python
It is retrieved the same positions for second line sys and stdout Nodes

start_position {
    line: 2
    col: 1
}

A child process hangs if I import bblfsh

If you run this code

import os
import sys
import time

import bblfsh

def func(n):
    print('Sleep during {} sec...'.format(n))
    time.sleep(n)
    return n

if __name__ == '__main__':
    for n in range(1, 3):
        pid = os.fork()
        if pid == 0:
            # if you do import bblfsh in this place everythink will be fine
            res = func(n)
            print('Sleep in is Done!')
            sys.exit(0)
        else:
            print('os.waitpid is waiting for {}...'.format(pid))
            _, status = os.waitpid(pid, 0)
            print('os.waitpid is fine!')
    print('ok, terminate.')

You will see, that first child process hangs during sys.exit(0).
If you comment import bblfsh everything will be fine.
Also, you can replace import bblfsh with

from bblfsh.github.com.bblfsh.sdk.protocol.generated_pb2 import ParseResponse

and have the same effect. And here is the main point. The problem is not exactly in bblfsh, but in grpc.

Seems that ParseResponse should be imported in child process because grpc starts threads during import and if you call fork after that, a child process will be hang during exit.

And here grpc/grpc#7951 (comment) you can find that fork is not supported.

So, the reason I write it here is just FYI, I think it is important to know.

And may be it is not a big deal to move ParseRequest and ProtocolServiceStub imports in /bblfsh/client.py to the place, where they are directly called. It can help to avoid this kind of problems. Or find more elegant way. What do you think?

P.S.: test it on MacOS and Ubuntu.

What the difference between IDENTIFIER and NAME Roles for python?

The documentation says:

// Identifier is any form of identifier, used for variable names, functions, packages, etc.
    Identifier
...
// Name is an identifier used to reference a value.
    Name

But in Python all identifiers are references.
So, what the difference between this two roles in Python?
If there is no difference we should add NAME role to each IDENTIFIER Node.

QUALIFIED_IDENTIFIER is not SIMPLE_IDENTIFIER and duplication of CALL_CALLEE role

I found that QUALIFIED_IDENTIFIER is not SIMPLE_IDENTIFIER, but @vmarkovtsev say that it is supposed to be.
Also, I found duplication of CALL_CALLEE role.

How to reproduce:

from bblfsh.client import BblfshClient

filepath = "./matplotlib_example.py"
bc = BblfshClient("0.0.0.0:9432")
res = bc.parse(filepath, language='Python')
print(res)

matplotlib_example.py:

from matplotlib import pyplot as plt
plt.figure()

Output (lines 76-83):

       token: "figure"
        start_position {
          line: 2
          col: 1
        }
        roles: CALL_CALLEE
        roles: CALL_CALLEE
        roles: QUALIFIED_IDENTIFIER

The problem is in figure token. It means that we do not take into account function names during our machine learning analysis.

P.S.: Moved from bblfsh/bblfshd#82 because it is a python specific problem.

SIMPLE_IDENTIFIER is not assigned to import symbols and aliases

Hi,
I was playing with UASTs and found some bug (as I think):
minimal reproducible example:

from os import path
import sys

import numpy as np

All SIMPLE_IDENTIFIERS: {'path': 1, 'numpy': 1, 'sys': 1} - as we can see - np and os are missed.

Some helper code to debug:

from collections import Counter

from ast2vec.bblfsh_roles import SIMPLE_IDENTIFIER
from ast2vec.repo2.base import Repo2Base


class Repo2IdModel:
    NAME = "Repo2IdModel"


class Repo2IdCounter(Repo2Base):
    """
    Print all SIMPLE_IDENTIFIERs (and counters) from repository
    """
    MODEL_CLASS = Repo2IdModel

    def collect_id_cnt(self, root, id_cnt):
        for ch in root.children:
            if SIMPLE_IDENTIFIER in ch.roles:
                id_cnt[ch.token] += 1
            self.collect_id_cnt(ch, id_cnt)

    def convert_uasts(self, file_uast_generator):
        for file_uast in file_uast_generator:
            print("-" * 20 + " " + str(file_uast.filepath))
            id_cnt = Counter()
            self.collect_id_cnt(file_uast.response.uast, id_cnt)
            print(id_cnt)


if __name__ == "__main__":
    repo = "test/imports/"
    c2v = Repo2IdCounter(linguist="path/to/enry", bblfsh_endpoint="0.0.0.0:9432")
    c2v.convert_repository(repo)

Try to automatically fix files with mixed tabs

This is proving to be a common error causing the Python AST module not to parse the source file, but it should be easily fixable using the reindent.py script/module included with the Python standard distribution.

IMPORT node problems: duplication, roles, positions

Hi,
I tried new version of python-driver and found several errors
Code in test.py:
from collections import defaultdict
Then launch bblfsh client:

egor@egor-sourced:~/workspace/uast_playground$ python3 -m bblfsh -f test.py 
uast {
  internal_type: "Module"
  children {
    internal_type: "ImportFrom"
    properties {
      key: "internalRole"
      value: "body"
    }
    properties {
      key: "level"
      value: "0"
    }
    children {
      internal_type: "alias"
      properties {
        key: "asname"
        value: "<nil>"
      }
      properties {
        key: "internalRole"
        value: "names"
      }
      token: "defaultdict"
      start_position {
        offset: 24
        line: 1
        col: 25
      }
      end_position {
        offset: 34
        line: 1
        col: 35
      }
      roles: IMPORT_PATH
      roles: SIMPLE_IDENTIFIER
    }
    children {
      internal_type: "ImportFrom.module"
      properties {
        key: "promotedPropertyString"
        value: "true"
      }
      token: "collections"
      roles: IMPORT_PATH
      roles: SIMPLE_IDENTIFIER
    }
    token: "collections"
    start_position {
      offset: 5
      line: 1
      col: 6
    }
    end_position {
      offset: 34
      line: 1
      col: 35
    }
    roles: IMPORT_DECLARATION
    roles: STATEMENT
  }
  start_position {
    line: 1
    col: 1
  }
  end_position {
    offset: 34
    line: 1
    col: 35
  }
  roles: FILE
}

so token collections is met twice (I think that it's wrong)

there is no start & end position

internal_type: "ImportFrom.module"
properties {
  key: "promotedPropertyString"
  value: "true"
}
token: "collections"
roles: IMPORT_PATH
roles: SIMPLE_IDENTIFIER

there is no role SIMPLE_IDENTIFIER

internal_type: "ImportFrom"
properties {
  key: "internalRole"
  value: "body"
}
properties {
  key: "level"
  value: "0"
}
children {
  internal_type: "alias"
  properties {
    key: "asname"
    value: "<nil>"
  }
  properties {
    key: "internalRole"
    value: "names"
  }
  token: "defaultdict"
  start_position {
    offset: 24
    line: 1
    col: 25
  }
  end_position {
    offset: 34
    line: 1
    col: 35
  }
  roles: IMPORT_PATH
  roles: SIMPLE_IDENTIFIER
}
children {
  internal_type: "ImportFrom.module"
  properties {
    key: "promotedPropertyString"
    value: "true"
  }
  token: "collections"
  roles: IMPORT_PATH
  roles: SIMPLE_IDENTIFIER
}
token: "collections"
start_position {
  offset: 5
  line: 1
  col: 6
}
end_position {
  offset: 34
  line: 1
  col: 35
}
roles: IMPORT_DECLARATION
roles: STATEMENT

Wrong Role assignment for t = set(); t = {}

If you extract UAST for

t = set()
t = {0,1}

You have next tree:

#  Token  Internal Role  Roles Tree                                      
                                                                         
   ||     Module         FILE                                            
1  ||     Assign         ┣ BINARY, THIS, EXPRESSION                      
1  |t|    Name           ┃ ┣ LEFT, IDENTIFIER, EXPRESSION                
1  ||     Call           ┃ ┣ FUNCTION, CALLEE, EXPRESSION, RIGHT         
1  |set|  Name           ┃ ┗ ┗ CALLEE, POSITIONAL, IDENTIFIER, EXPRESSION
2  ||     Assign         ┣ BINARY, THIS, EXPRESSION                      
2  |t|    Name           ┃ ┣ LEFT, IDENTIFIER, EXPRESSION                
2  ||     Set            ┃ ┣ BYTE, STRING, EXPRESSION, RIGHT             
2  |0|    Num            ┃ ┃ ┣ BYTE, REGEXP, EXPRESSION                  
2  |1|    Num            ┗ ┗ ┗ BYTE, REGEXP, EXPRESSION

What I expect to see is:

#  Token  Internal Role  Roles Tree                                      
                                                                         
   ||     Module         FILE                                            
1  ||     Assign         ┣ BINARY, ~THIS~, EXPRESSION, +Assignment+                      
1  |t|    Name           ┃ ┣ LEFT, IDENTIFIER, EXPRESSION                
1  ||     Call           ┃ ┣ FUNCTION, ~CALLEE~, EXPRESSION, RIGHT, +CALL+         
1  |set|  Name           ┃ ┗ ┗ CALLEE, ~POSITIONAL~, IDENTIFIER, EXPRESSION, +Name+
2  ||     Assign         ┣ BINARY, ~THIS~, EXPRESSION, +Assignment+                      
2  |t|    Name           ┃ ┣ LEFT, IDENTIFIER, EXPRESSION                
2  ||     Set            ┃ ┣ ~BYTE~, ~STRING~, EXPRESSION, RIGHT, +SET+, ?TYPE?              
2  |0|    Num            ┃ ┃ ┣ ~BYTE~, ~REGEXP~, EXPRESSION, +NUMBER+, +VALUE+                  
2  |1|    Num            ┗ ┗ ┗ ~BYTE~, ~REGEXP~, EXPRESSION, +NUMBER+, +VALUE+

Legend:

+ROLE+ -- add Role
~ROLE~ -- remove Role
?ROLE? -- maybe add/remove Role

Gist to generate UAST Roles visualization: https://gist.github.com/zurk/d314d67d9aac8843d3776c82cd738b40

Exception deserializing message in BblfshClient.parse()

I try to run bblfsh python client and it fails (change end point if you need):

from bblfsh import BblfshClient
BblfshClient("172.17.0.1:9432").parse('./TickType.py', language='Python', )

Here is file example: TickType.py.zip

The output I get:

ERROR:root:Exception deserializing message!
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/grpc/_common.py", line 129, in _transform
    return transformer(message)
google.protobuf.message.DecodeError: Error parsing message
Traceback (most recent call last):
  File "bug_exapmle.py", line 3, in <module>
    BblfshClient("172.17.0.1:9432").parse('./temp/TickType.py', language='Python', )
  File "/usr/local/lib/python3.5/dist-packages/bblfsh/client.py", line 58, in parse
    response = self._stub.Parse(request, timeout=timeout)
  File "/usr/local/lib/python3.5/dist-packages/grpc/_channel.py", line 507, in __call__
    return _end_unary_response_blocking(state, call, False, deadline)
  File "/usr/local/lib/python3.5/dist-packages/grpc/_channel.py", line 455, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.INTERNAL, Exception deserializing response!)>

And bblfsh server does not fails in this example if I run it directly.
Seems to be some problem in grpc module...

status: ERROR errors: "column out of bounds: 0 [1, 1]"

Hi,
I tried to extract UAST from python file and got an unexpected result - result of extraction starts with the
error message.

The content of exp.py:

# Here you can find the code of /app/exp.py:
# find x in range (-2, 2) that should minimize math.sin function
import math
from hyperopt import fmin, tpe, hp
from hyperopt.mongoexp import MongoTrials

mongodb = "mongodb"
db_port = "27017"
db_name = "ml_exp"
exp_key = "exp1"

trials = MongoTrials("mongo://%s:%s/%s/jobs" % (mongodb, db_port, db_name), exp_key="exp1")
best = fmin(math.sin, hp.uniform("x", -2, 2), trials=trials, algo=tpe.suggest, max_evals=10)
print("Result:", best)

and result of extraction:

status: ERROR
errors: "column out of bounds: 0 [1, 1]"
uast {
  internal_type: "Module"
...

Full UAST is attached:
exp.uast.txt

PS:
is it correct that we have several lines of roles?

      roles: STRING_LITERAL
      roles: EXPRESSION
      roles: ASSIGNMENT_VALUE

Wrong Role assignment for f()

If you extract UAST for

f()

You have next tree:

#  Token  Internal Role  Roles Tree                                      
                                                                         
   ||     Module         FILE                                            
1  ||     Expr           ┣ EXPRESSION                                    
1  ||     Call           ┃ ┣ FUNCTION, CALLEE, EXPRESSION                
1  |f|    Name           ┗ ┗ ┗ CALLEE, POSITIONAL, IDENTIFIER, EXPRESSION

What I expect to see is:

#  Token  Internal Role  Roles Tree                                      
                                                                         
   ||     Module         FILE                                            
1  ||     Expr           ┣ EXPRESSION                                    
1  ||     Call           ┃ ┣ FUNCTION, ~CALLEE~, EXPRESSION, +CALL+                
1  |f|    Name           ┗ ┗ ┗ CALLEE, ~POSITIONAL~, IDENTIFIER, EXPRESSION,+Name+

Legend:

+ROLE+ -- add Role
~ROLE~ -- remove Role
?ROLE? -- maybe add/remove Role

Gist to generate UAST Roles visualization: https://gist.github.com/zurk/d314d67d9aac8843d3776c82cd738b40

Wrong Role assignment for (0,)

If you extract UAST for

(0,)

You have next tree:

#  Token  Internal Role  Roles Tree                    
                                                       
   ||     Module         FILE                          
1  ||     Expr           ┣ EXPRESSION                  
1  ||     Tuple          ┃ ┣ BYTE, TYPE, EXPRESSION    
1  |0|    Num            ┗ ┗ ┗ BYTE, REGEXP, EXPRESSION

What I expect to see is:

#  Token  Internal Role  Roles Tree                    
                                                       
   ||     Module         FILE                          
1  ||     Expr           ┣ EXPRESSION                  
1  ||     Tuple          ┃ ┣ ~BYTE~, ?TYPE?, EXPRESSION, +TUPLE+    
1  |0|    Num            ┗ ┗ ┗ ~BYTE~, ~REGEXP~, EXPRESSION, +NUMBER+, +VALUE+

Legend:

+ROLE+ -- add Role
~ROLE~ -- remove Role
?ROLE? -- maybe add/remove Role

Gist to generate UAST Roles visualization: https://gist.github.com/zurk/d314d67d9aac8843d3776c82cd738b40

There are still nodes with Column set to 0

All whitespace nodes.
Function arguments.
All operators.

Some nodes have the Column location at 0

Some nodes (probably the ones originating from a property in the Python native AST, without location info) have the Column at 0, which is wrong. They should take the Col from their parent node in the case of properties.

This blocks the fixing of the offset with the TransformationUASTParser.

Decorators have the same line number as the functions they decorate

Hi,
I tested https://github.com/EgorBu/uast_playground at a new file
python3 -m uast_playground repo2id_str -r test_data/exp2

wrong start and end positions of tokens (token has more than 1 line, overlaps, wrong length, etc)
decorators and function definitions have the same start line

Identifiers with empty tokens.

I found identifiers with empty tokens.
https://gist.github.com/zurk/8f7dd974347925ae62c31d9441491613

Here you can find my example.
run run_me.py and you get output for example.py
I think that all identifiers cannot be with empty tokens.
Or am I wrong?

Strange <nil> token. Should be None?

I found it in my visualization example:
https://gist.github.com/zurk/d314d67d9aac8843d3776c82cd738b40

8  ||             ┣ BINARY, THIS, EXPRESSION                                     
8  |var1|         ┃ ┣ LEFT, IDENTIFIER, EXPRESSION                               
7  ||             ┃ ┃ ┣ INCOMPLETE                                               
7  |\n|           ┃ ┃ ┗ ┗ DOCUMENTATION                                          
8  |<nil>|        ┃ ┗ BYTE, NUMBER, EXPRESSION, RIGHT

If you take a look at the code, you will see that actually corresponds to None token.
Also, I doubt that None should have a NUMBER role, maybe NULL?

Missing QUALIFIED role

if you get UAST for

import lib1
lib1.lib2.lib3.var = None

You will have next list of Identifiers with roles:

<lib1>:         ['IMPORT', 'PATHNAME', 'IDENTIFIER']
<var>:          ['IDENTIFIER', 'EXPRESSION', 'LEFT']
<lib3>:         ['IDENTIFIER', 'EXPRESSION']
<lib2>:         ['IDENTIFIER', 'EXPRESSION']
<lib1>:         ['IDENTIFIER', 'QUALIFIED', 'IDENTIFIER', 'EXPRESSION']

QUALIFIED role is missing in lib2 and lib3 identifiers.
you can use code from issue #94 to reproduse result.
this gist: https://gist.github.com/zurk/8f7dd974347925ae62c31d9441491613

Wrong position of SIMPLE_IDENTIFIER when using 'global'

Hi,
the position of SIMPLE_IDENTIFIER has an error when using global context.
ex:

abc = None


def b():
    global abc
    pass

Vis:

SIMPLE_IDENTIFIER is not assigned to argument names

Hi!
SIMPLE_IDENTIFIER is not assigned when it has to be assigned.
better ex:

def some_funct(some_arg):
  pass

some_funct(some_arg=some_val)

some_funct(some_arg=some_arg)

Vis:

PS: and there's some magic when you pass variable with the same name as argument

Strange role assignment in version v0.9.0

It is not published in github, but the latest here https://hub.docker.com/r/bblfsh/python-driver/tags/

I will give an example from dashboard:

As you can see there are completely strange roles in bblfsh.
If I run version v0.8.2 using BBLFSH_DRIVER_IMAGES="python=docker://bblfsh/python-driver:v0.8.2" everything is fine.

Maybe It is erroneous release...

Some Python parsing errors are not propagated to the driver response

For example, this file: https://github.com/damoeb/kalipo/blob/master/kalipo-ir/harvester/spiders/heise_spider.py produces a TabError/SyntaxError (Python 3/2) but the only error that the driver returns is "Could not determine Python version", which is true, but also incomplete; the original SyntaxError should be added to the driver response.

IfCondition Role not asigned in some cases

When you have an If which don't have the usual form like a binary expression the role IfCondition dont appear in the UAST, for an input like this:

 if True:
	    print(True)

if functionCallThatReturnsABoolean(){
           print(True)
}

the uast roles that we expect is something like this

If{
           IfCondition,
           IfBody
}

and we get this

If{
           IfBody
           expresion/functionCall
}

Use glide to lock the SDK version

Positions with code embedded in f-strings are not always correct

Code inside Python's fstrings have it's position starting in line 1 column 1 with disregard of the real position. Some of these (line numbers) are fixed by the synchronized tokenizer, but not in all cases are columns fixed.

range(10) wrong role assignment

If you extract UAST for range(10) you have next tree:

line#   token       Roles

        ||          FILE                                                                                                     
1       ||          ┣ EXPRESSION                                                                                             
1       ||          ┃ ┣ FUNCTION, CALLEE, EXPRESSION                                                                         
1       |10|        ┃ ┃ ┣ EXPRESSION, FUNCTION, DECLARATION, ARGUMENT, NAME, IDENTIFIER, CALLEE, ARGUMENT, NOOP
1       |range|     ┗ ┗ ┗ CALLEE, POSITIONAL, IDENTIFIER, EXPRESSION

What I expect to see is:

line#   token       Roles

        ||          FILE                                                                                                     
1       ||          ┣ EXPRESSION                                                                                             
1       ||          ┃ ┣ FUNCTION, CALLEE, EXPRESSION                                                                         
1       |10|        ┃ ┃ ┣ NUMBER, EXPRESSION, ARGUMENT, NAME, IDENTIFIER, ARGUMENT, POSITIONAL, VALUE
1       |range|     ┗ ┗ ┗ CALLEE, IDENTIFIER

not sure about EXPRESSION role. It is too common.

Also, I have another experiment and find out that if you parse just 10 you have:

  ||   FILE                        
1 ||   ┣ EXPRESSION                
1 |10| ┗ ┗ BYTE, REGEXP, EXPRESSION

What I expect to see is:

  ||   FILE                        
1 ||   ┣ EXPRESSION                
1 |10| ┗ ┗ NUMBER, EXPRESSION, VALUE

BTW, what about MODULE role? Each file in python is considered as the module, or I am wrong?

Add endposition to nodes

From:

#30

Node endpositions are not mandatory in the current spec if the native driver doesn't provide them as happen with the Python driver, but it would be nice to have them on this driver.

CC @dpordomingo

Evaluate jedi as an alternative parser to handle positioning and python detection

http://jedi.readthedocs.io/

Decorator for class breaks logic of computation of start position of class name

Hi,

ex:

@register_class
class a:
    pass

vis:

Bblfsh fails to extract UAST from file

I have strange error trying to get UAST from this file: oo.py.zip

I run this code

from bblfsh.client import BblfshClient
bc = BblfshClient("0.0.0.0:9432")
res = bc.parse("./oo.py", language='Python')
print(res)

Output:

status: FATAL
errors: "expected object of type map[string]interface{}, got: \"NoneLiteral\""

but py file seems to be correct because I can run python3 ./oo.py without any problem

Also, bblfsh server is running in docker and output only

time="2017-08-08T20:34:55Z" level=info msg="parsing oo.py (34525 bytes)"

SIMPLE_IDENTIFIER is assigned wrongly in `with` statement

Hi,
ex:

with open(os.path.join(args.output, "row_vocab.txt"), "w") as out:
    out.write('\n'.join(chosen_words))

and UAST containes node with emty token, wrong position (0,0):

internal_type: "With.items"
properties {
  key: "promotedPropertyList"
  value: "true"
}
children {
  internal_type: "withitem"
  children {
    internal_type: "Name"
    properties {
      key: "ctx"
      value: "Load"
    }
    properties {
      key: "internalRole"
      value: "context_expr"
    }
    token: "a"
    start_position {
      offset: 5
      line: 1
      col: 6
    }
    end_position {
      offset: 5
      line: 1
      col: 6
    }
    roles: SIMPLE_IDENTIFIER
    roles: EXPRESSION
  }
  children {
    internal_type: "Name"
    properties {
      key: "ctx"
      value: "Store"
    }
    properties {
      key: "internalRole"
      value: "optional_vars"
    }
    token: "b"
    start_position {
      offset: 10
      line: 1
      col: 11
    }
    end_position {
      offset: 10
      line: 1
      col: 11
    }
    roles: SIMPLE_IDENTIFIER
    roles: EXPRESSION
  }
  start_position {
    line: 1
    col: 1
  }
  end_position {
    offset: 10
    line: 1
    col: 11
  }
  roles: SIMPLE_IDENTIFIER
  roles: INCOMPLETE
}
roles: SIMPLE_IDENTIFIER
roles: EXPRESSION
roles: INCOMPLETE

Nodes without any roles.

I wrote a small tool for collecting statistics for number of nodes w.r.t. number of node roles in UASTs. It turned out that for my dataset there're some cases when no roles are assigned to a UAST node.

Repositories: /storage/timofei/repos
Extracted UASTs: /storage/timofei/uasts
Collected statistics: uasts_stat.txt
List of suspicious UASTs (csv file with columns: path to UAST, total number of nodes, number of nodes without roles): uasts_susp.txt

Parsing an empty file produces a fatal error

Sending an empty file to the Python driver produces a fatal error (with by definition stops the driver from processing more requests). This shouldn't be so since empty files are common in Python (init.py), it should just produce an error with an empty UAST returned.

"column out of bounds" on minitwit

Merge bblfsh/python-client#38 to print errors

Then

git clone https://github.com/pallets/flask
python3 -m bblfsh -f flask/examples/minitwit/minitwit/minitwit.py >/dev/null

And you get an error from bblfsh:

column out of bounds: 63 [1, 51]

The file is parsed though.

SIMPLE_IDENTIFIER is not assigned when it has to be

Hi,
I tried to extract UAST from python code and noticed when you define a function:
def a(b, c): ...
the node for this function will have roles: ‘FUNCTION_DECLARATION_BODY’, ‘FUNCTION_DECLARATION_RECEIVER’, but not 'SIMPLE_IDENTIFIER'.
In the documentation it's mentioned:

// SimpleIdentifier is the most basic form of identifier, used for variable
// names, functions, packages, etc.

I think that this node should have 'SIMPLE_IDENTIFIER' role.

CALL Node for len has no CALL_CALLEE child

If you get UAST for len(x) you will get no CALL_CALLEE child in CALL Node but the documentation (http://godoc.org/github.com/bblfsh/sdk/uast#Role) says:

// Call is any call, whether it is a function, procedure, method or macro.
// In its simplest form, a call will have a single child with a function
// name (CallCallee).

So, as I understand any CALL should have CALL_CALLEE child.

Divide the astexport.py module between pydetector and the driver

In the meeting planning we decided to split the functionality in the current pydetector.astexport.py module between the retrieval of the native AST data structure unmodified (but for the right Python version) in pydetector and the visitor + noop extractor + position updater in python-driver, reusing the data returned from pydetector to avoid doing a double parsing.

Incorrect positions for nodes: nodes have the same position (line continuation?)

Hi,
I made some experiments and found bug - SIMPLE_IDENTIFIER nodes have the same positions.

Reproducible example:

a += b.c["Some val"] \
    .d

And uast_playground gives us:

# New token 'b' at position (1, 6) has the same position as token 'd' at the same position. Skip new token.
a += b.c["Some val"] \
# Something wrong with token 'd' at pos (1, 6) - it's not equal to 'b' at this position in code
    .d

It looks like that it happens because of line continuation because in case of code:

a += b.c["Some val"].d

everything works well.

BTW: it looks like that d is higher in UAST than b - is it correct? Because it appears earlier during traversing of UAST

Screenshot:

role assignment in a = b = c

If you have this code:

a = b = c 
var1 == var2 == var3
var4 < var5 < var6

And run UAST extraction you have strange role assignment.

Please take a look at the code. https://gist.github.com/zurk/66a3045746287bdb5002c0812b94f611
Here is output (the same gist):
https://gist.github.com/zurk/66a3045746287bdb5002c0812b94f611#file-output

Comments for output:

//*[@roleIdentifier] :
<a>:            ['LEFT', 'IDENTIFIER', 'EXPRESSION']
<b>:            ['LEFT', 'IDENTIFIER', 'EXPRESSION']
<c>:            ['RIGHT', 'IDENTIFIER', 'EXPRESSION']
<var2>:         ['IDENTIFIER', 'EXPRESSION']
<var3>:         ['IDENTIFIER', 'EXPRESSION']
<var1>:         ['IDENTIFIER', 'EXPRESSION', 'EXPRESSION', 'BINARY', 'LEFT']
<var5>:         ['IDENTIFIER', 'EXPRESSION']
<var6>:         ['IDENTIFIER', 'EXPRESSION']
<var4>:         ['IDENTIFIER', 'EXPRESSION', 'EXPRESSION', 'BINARY', 'LEFT']

I am not sure how it should be but at least var3 and var6 are on right side. :)
Why we have 'EXPRESSION', 'BINARY' for var1 and var4? I think it is just EXPRESSION as for all others. BINARY is upper level in UAST tree.

Also, I am not sure that we can call second and last expressions as binary at all.
Maybe, the first line of code can be considered as two binary expressions.

//*[@roleLeft] :
<a>:            ['LEFT', 'IDENTIFIER', 'EXPRESSION']
<b>:            ['LEFT', 'IDENTIFIER', 'EXPRESSION']
<var1>:         ['IDENTIFIER', 'EXPRESSION', 'EXPRESSION', 'BINARY', 'LEFT']
<var4>:         ['IDENTIFIER', 'EXPRESSION', 'EXPRESSION', 'BINARY', 'LEFT']

Ok, it can be true for 'a' and 'b'.

//*[@roleRight] :
<c>:            ['RIGHT', 'IDENTIFIER', 'EXPRESSION']
<>:             ['EXPRESSION', 'BINARY', 'RIGHT']
<>:             ['EXPRESSION', 'BINARY', 'RIGHT']

tokens var4 and var6 missing?

//*[@roleBinary] :
<>:             ['BINARY', 'THIS', 'EXPRESSION']
<>:             ['EXPRESSION', 'BINARY']
<>:             ['EXPRESSION', 'BINARY', 'RIGHT']
<var1>:         ['IDENTIFIER', 'EXPRESSION', 'EXPRESSION', 'BINARY', 'LEFT']
<>:             ['EXPRESSION', 'BINARY', 'OPERATOR']
<==>:           ['BINARY', 'OPERATOR', 'EQUAL']
<==>:           ['BINARY', 'OPERATOR', 'EQUAL']
<>:             ['EXPRESSION', 'BINARY']
<>:             ['EXPRESSION', 'BINARY', 'RIGHT']
<var4>:         ['IDENTIFIER', 'EXPRESSION', 'EXPRESSION', 'BINARY', 'LEFT']
<>:             ['EXPRESSION', 'BINARY', 'OPERATOR']
<<>:            ['BINARY', 'OPERATOR', 'LESS_THAN']
<<>:            ['BINARY', 'OPERATOR', 'LESS_THAN']

Everything is fine for a = b = c statement maybe except THIS role, but I am not sure. Please take a look at the defenition and if it suitable here.

And there is a mess for second two lines of code.

<>: ['EXPRESSION', 'BINARY', 'OPERATOR']
seems that it is the node for the full ternary operator because it is without a token. Not sure.

Hope it helps to investigate the problem.

Server "hanging". Empty code received, returning empty UAST

Related to this issue bblfsh/bblfshd#101 (actually problem was no in the server but in python-driver). I have kind of the same symptoms with a new driver. At some moment the server logs:

time="2017-09-22T15:41:39Z" level=debug msg="Empty code received, returning empty UAST"

And then nothing, but my program actually continues to send queries. At some moment (~after 30sec) server logs

time="2017-09-22T15:42:11Z" level=debug msg="driver exited without error"

and then you can actually continue parsing.

I couldn't find the file which breaks everything, and also if I run 1 thread for queries, everything seems to be fine.

Code to reproduce:
https://gist.github.com/zurk/2d9e786e6577ebe60e963091c13b4ecd

files.txt
they are on science-3. Can download it and attach if you want.

SIMPLE_IDENTIFIER is not assigned in exception

Hi
Ex:

def a():
    env = None
    try:
        something()
    except Exception as e:
        print("Something wrong")
        raise e from None

and e is not SIMPLE_IDENTIFIER.
Vis:

Add the Endposition keys to the parser.go once the SDK parses them

Depends on: bblfsh/sdk#153

Change the Dockerfile to install pydetector from the local filesystem if some directory exists

This way we can avoid uploading new versions to pypi for testing changes affecting the Python driver. It could work this way: if a directory exists in the local directory (can be a symbolic link), it would use it. If not it would download from PyPi.

bblfsh / python-driver Goto Github PK

python-driver's People

Contributors

Stargazers

Watchers

Forkers

python-driver's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs