bblfsh / python-driver Goto Github PK
View Code? Open in Web Editor NEWLicense: GNU General Public License v3.0
License: GNU General Public License v3.0
The documentation says:
// Primitive is a language builtin.
Primitive
Python has a lot of built-in names, but since I can override any built-in function it is not possible to say if this name built-in or not.
Can you please clarify Primitive
role meaning for Python?
I run this code:
from bblfsh.client import BblfshClient
bc = BblfshClient("0.0.0.0:9432")
res = bc.parse("example.py", language="Python")
print(res)
where example.py
is
import matplotlib.pyplot
Here is full output:
uast {
internal_type: "Module"
children {
internal_type: "Import"
properties {
key: "internalRole"
value: "body"
}
children {
internal_type: "Import.names"
properties {
key: "promotedPropertyList"
value: "true"
}
children {
internal_type: "alias"
properties {
key: "asname"
value: "<nil>"
}
token: "matplotlib.pyplot"
start_position {
line: 1
col: 1
}
roles: IMPORT_PATH
roles: SIMPLE_IDENTIFIER
}
roles: 142
}
start_position {
line: 1
col: 1
}
roles: IMPORT_DECLARATION
roles: STATEMENT
}
start_position {
line: 1
col: 1
}
roles: FILE
}
Why don't I have matplotlib as a simple identifier?
Is it correct?
I found it in the output of this example https://gist.github.com/zurk/8f7dd974347925ae62c31d9441491613 for issue #94.
for self
token we have role duplication:
<self>: ['IDENTIFIER', 'QUALIFIED', 'IDENTIFIER', 'EXPRESSION']
or for class1
:
<class1>: ['IDENTIFIER', 'QUALIFIED', 'IDENTIFIER', 'EXPRESSION']
If you extract UAST for
""""
You have next tree:
# Token Internal Role Roles Tree
|| Module FILE
1 || Expr ┣ EXPRESSION
1 || Str ┗ ┗ BYTE, TUPLE, EXPRESSION
What I expect to see is:
# Token Internal Role Roles Tree
|| Module FILE
1 || Expr ┣ EXPRESSION
1 || Str ┗ ┗ ~BYTE~, ~TUPLE~, EXPRESSION, +STRING+, +VALUE+
Legend:
+ROLE+ -- add Role
~ROLE~ -- remove Role
?ROLE? -- maybe add/remove Role
Gist to generate UAST Roles visualization: https://gist.github.com/zurk/d314d67d9aac8843d3776c82cd738b40
If you extract UAST for
{1:2}
You have next tree:
# Token Internal Role Roles Tree
|| Module FILE
1 || Expr ┣ EXPRESSION
1 || Dict ┃ ┣ BYTE, NULL, EXPRESSION
1 |1| Num ┃ ┃ ┣ BYTE, REGEXP, EXPRESSION, NULL, PRIMITIVE
1 |2| Num ┗ ┗ ┗ BYTE, REGEXP, EXPRESSION, NULL, VALUE
What I expect to see is:
# Token Internal Role Roles Tree
|| Module FILE
1 || Expr ┣ EXPRESSION
1 || Dict ┃ ┣ ~BYTE~, ~NULL~, EXPRESSION, ?TYPE?
1 |1| Num ┃ ┃ ┣ ~BYTE~, ~REGEXP~, EXPRESSION, ~NULL~, ~PRIMITIVE~, +NUMBER+, +VALUE+
1 |2| Num ┗ ┗ ┗ ~BYTE~, ~REGEXP~, EXPRESSION, ~NULL~, +NUMBER+, +VALUE+
Legend:
+ROLE+ -- add Role
~ROLE~ -- remove Role
?ROLE? -- maybe add/remove Role
Gist to generate UAST Roles visualization: https://gist.github.com/zurk/d314d67d9aac8843d3776c82cd738b40
This code:
class Repo2nBOW(Repo2Base):
@property
def id2vec(self):
return self._id2vec
Produces and offset of 0 for _id2vec
.
Using latest https://gist.github.com/bzz/c0c3dbcab5fecbe48e22167e2ad78595 UAST parsing fails on what seems to be https://github.com/damoeb/kalipo/blob/master/kalipo-ir/harvester/spiders/heise_spider.py
Serve log
time="2017-06-21T13:44:13Z" level=debug msg="sending ParseUAST request: Filename:"kalipo-ir/harvester/spiders/heise_spider.py" Language:"python" Content:"import scrapy\nfrom scrapy.contrib.spiders import CrawlSpider, Rule\nfrom scrapy.contrib.linkextractors import LinkExtractor\nfrom scrapy.selector import Selector\n\nfrom harvester.items import Comment\nimport time\nimport calendar\nimport re\n\nclass HeiseSpider(CrawlSpider):\n name = \"heise\"\n allowed_domains = [\"www.heise.de\"]\n start_urls = [\n \"http://www.heise.de/forum/Telepolis/Kommentare/Ohne-Vorratsdatenspeicherung-sterben-vermisste-Kinder-und-Suizidale/forum-242979/\"\n ]\n\n rules = (\n #Rule(LinkExtractor(allow=('/tp/foren/[^/]+/forum-[0-9]+/list'))),\n\tRule(LinkExtractor(allow=('/posting-[0-9]+/show')), callback='parse_item')\n )\n\n def clean_str(self, val):\n\treturn val.replace(u'\\xa0', u' ').strip()\n\n def to_str(self, arr):\n\treturn self.clean_str(''.join(arr))\n\n def parse_date(self, val):\n\tgrps = re.search('[0-9]+\\. ([A-Za-z]+) [0-9]{4} [0-9]{2}:[0-9]{2}', val)\n\n\tmnth = grps.group(1)\n\n \tmonths = ['Januar', 'Februar', 'M\\u00e4rz', 'April', 'Mai', 'Juni', 'Juli', 'August', 'September', 'Oktober', 'November', 'Dezember']\n\tfor index, item in enumerate(months):\n\t if item.lower() == mnth.lower():\n\t val = val.replace(mnth, str(index))\n\t break\n\n\treturn calendar.timegm(time.strptime(val, \"%d. %m %Y %H:%M\"))\n\n def parse_item(self, response):\n sel = Selector(response)\n\n\n\tisRoot = len(response.xpath(\"//ul[@class='forum_navi'][2]/li\")) == 6\n\n\tif !isRoot:\n\t # find parent\n\t parent = response.xpath(\"//span[@class='active_post']/../../../parent::ul[@class='nextlevel_line']/preceding-sibling::div[@class='hover_line']\")\n\t # get link\n\t link = parent.xpath(\".//div[@class='thread_title']/a\")\n\t # extract parent id from href\n\n\n\titem = Comment()\n\titem['text'] = self.to_str(sel.xpath(\"//h3[@class='posting_subject']/text()\").extract()) + self.to_str(sel.xpath(\"//p[@class='posting_text']/text()\").extract())\n\titem['url'] = response.url\n\titem['parent'] = 'unknown'\n\titem['level'] = 0\n\titem['thread'] = re.search('forum-([0-9]+)', response.url).group(1)\n\titem['author'] = self.to_str(sel.xpath(\"//div[@class='user_info']/i//text()\").extract())\n\titem['date'] = self.parse_date(self.to_str(response.xpath(\"//div[@class='posting_date']/text()\").extract()))\n return item\n\n" "
time="2017-06-21T13:35:14Z" level=error msg="driver bblfsh/python-driver:latest (01BK5BZ6N1S7MZBCSFPADDBFSW) stderr: ERROR:root:Filepath: , Errors: ['Traceback (most recent call last):\n File "/usr/lib/python3.6/site-packages/python_driver/requestprocessor.py", line 151, in process_request\n raise Exception(\'Could not determine Python version\')\nException: Could not determine Python version\n']"
Client logs
Read kalipo-ir/harvester/spiders/heise_spider.py, 2247 bytes Parsing file:'kalipo-ir/harvester/spiders/heise_spider.py'
Panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1186147]
goroutine 1 [running]:
github.com/bblfsh/sdk/uast.(*Node).ProtoSize(0x0, 0xc4201f9f50)
/go/src/github.com/bblfsh/sdk/uast/generated.pb.go:510 +0x37
github.com/bblfsh/sdk/uast.(*Node).Marshal(0x0, 0x1c18070, 0xc4200102c0, 0xc4201f9f20, 0x0, 0x0)
/go/src/github.com/bblfsh/sdk/uast/generated.pb.go:352 +0x2f
main.main.func1(0xc4201c4e60, 0xc4201c4e60, 0x0)
/go/src/github.com/src-d/analysis-pipeline/juanjo/pyFromGit2ast2pb.go:69 +0x4d5
It is not retrieved start_position.offset
field nor end_position
as documented by https://doc.bblf.sh/uast/specification.html
All token positions on the same line has the same start_position
[...] it is guaranteed that nodes in a UAST either have no position attached or they have a position with valid offset, line and col. [...]
End position, if present in a token node, is the position of the last character of the token in the original source code.
example:
import sys
sys.stdout.write("Hello world!\n")
client: client-python
It is retrieved the same positions for second line sys
and stdout
Nodes
start_position {
line: 2
col: 1
}
If you run this code
import os
import sys
import time
import bblfsh
def func(n):
print('Sleep during {} sec...'.format(n))
time.sleep(n)
return n
if __name__ == '__main__':
for n in range(1, 3):
pid = os.fork()
if pid == 0:
# if you do import bblfsh in this place everythink will be fine
res = func(n)
print('Sleep in is Done!')
sys.exit(0)
else:
print('os.waitpid is waiting for {}...'.format(pid))
_, status = os.waitpid(pid, 0)
print('os.waitpid is fine!')
print('ok, terminate.')
You will see, that first child process hangs during sys.exit(0)
.
If you comment import bblfsh
everything will be fine.
Also, you can replace import bblfsh
with
from bblfsh.github.com.bblfsh.sdk.protocol.generated_pb2 import ParseResponse
and have the same effect. And here is the main point. The problem is not exactly in bblfsh, but in grpc
.
Seems that ParseResponse
should be imported in child process because grpc starts threads during import and if you call fork after that, a child process will be hang during exit.
And here grpc/grpc#7951 (comment) you can find that fork is not supported.
So, the reason I write it here is just FYI, I think it is important to know.
And may be it is not a big deal to move ParseRequest
and ProtocolServiceStub
imports in /bblfsh/client.py
to the place, where they are directly called. It can help to avoid this kind of problems. Or find more elegant way. What do you think?
P.S.: test it on MacOS and Ubuntu.
The documentation says:
// Identifier is any form of identifier, used for variable names, functions, packages, etc.
Identifier
...
// Name is an identifier used to reference a value.
Name
But in Python all identifiers are references.
So, what the difference between this two roles in Python?
If there is no difference we should add NAME role to each IDENTIFIER Node.
I found that QUALIFIED_IDENTIFIER
is not SIMPLE_IDENTIFIER
, but @vmarkovtsev say that it is supposed to be.
Also, I found duplication of CALL_CALLEE
role.
How to reproduce:
from bblfsh.client import BblfshClient
filepath = "./matplotlib_example.py"
bc = BblfshClient("0.0.0.0:9432")
res = bc.parse(filepath, language='Python')
print(res)
matplotlib_example.py
:
from matplotlib import pyplot as plt
plt.figure()
Output (lines 76-83):
token: "figure"
start_position {
line: 2
col: 1
}
roles: CALL_CALLEE
roles: CALL_CALLEE
roles: QUALIFIED_IDENTIFIER
The problem is in figure
token. It means that we do not take into account function names during our machine learning analysis.
P.S.: Moved from bblfsh/bblfshd#82 because it is a python specific problem.
Hi,
I was playing with UASTs and found some bug (as I think):
minimal reproducible example:
from os import path
import sys
import numpy as np
All SIMPLE_IDENTIFIERS: {'path': 1, 'numpy': 1, 'sys': 1}
- as we can see - np
and os
are missed.
Some helper code to debug:
from collections import Counter
from ast2vec.bblfsh_roles import SIMPLE_IDENTIFIER
from ast2vec.repo2.base import Repo2Base
class Repo2IdModel:
NAME = "Repo2IdModel"
class Repo2IdCounter(Repo2Base):
"""
Print all SIMPLE_IDENTIFIERs (and counters) from repository
"""
MODEL_CLASS = Repo2IdModel
def collect_id_cnt(self, root, id_cnt):
for ch in root.children:
if SIMPLE_IDENTIFIER in ch.roles:
id_cnt[ch.token] += 1
self.collect_id_cnt(ch, id_cnt)
def convert_uasts(self, file_uast_generator):
for file_uast in file_uast_generator:
print("-" * 20 + " " + str(file_uast.filepath))
id_cnt = Counter()
self.collect_id_cnt(file_uast.response.uast, id_cnt)
print(id_cnt)
if __name__ == "__main__":
repo = "test/imports/"
c2v = Repo2IdCounter(linguist="path/to/enry", bblfsh_endpoint="0.0.0.0:9432")
c2v.convert_repository(repo)
This is proving to be a common error causing the Python AST module not to parse the source file, but it should be easily fixable using the reindent.py
script/module included with the Python standard distribution.
Hi,
I tried new version of python-driver and found several errors
Code in test.py:
from collections import defaultdict
Then launch bblfsh client:
egor@egor-sourced:~/workspace/uast_playground$ python3 -m bblfsh -f test.py
uast {
internal_type: "Module"
children {
internal_type: "ImportFrom"
properties {
key: "internalRole"
value: "body"
}
properties {
key: "level"
value: "0"
}
children {
internal_type: "alias"
properties {
key: "asname"
value: "<nil>"
}
properties {
key: "internalRole"
value: "names"
}
token: "defaultdict"
start_position {
offset: 24
line: 1
col: 25
}
end_position {
offset: 34
line: 1
col: 35
}
roles: IMPORT_PATH
roles: SIMPLE_IDENTIFIER
}
children {
internal_type: "ImportFrom.module"
properties {
key: "promotedPropertyString"
value: "true"
}
token: "collections"
roles: IMPORT_PATH
roles: SIMPLE_IDENTIFIER
}
token: "collections"
start_position {
offset: 5
line: 1
col: 6
}
end_position {
offset: 34
line: 1
col: 35
}
roles: IMPORT_DECLARATION
roles: STATEMENT
}
start_position {
line: 1
col: 1
}
end_position {
offset: 34
line: 1
col: 35
}
roles: FILE
}
so token collections
is met twice (I think that it's wrong)
internal_type: "ImportFrom.module"
properties {
key: "promotedPropertyString"
value: "true"
}
token: "collections"
roles: IMPORT_PATH
roles: SIMPLE_IDENTIFIER
internal_type: "ImportFrom"
properties {
key: "internalRole"
value: "body"
}
properties {
key: "level"
value: "0"
}
children {
internal_type: "alias"
properties {
key: "asname"
value: "<nil>"
}
properties {
key: "internalRole"
value: "names"
}
token: "defaultdict"
start_position {
offset: 24
line: 1
col: 25
}
end_position {
offset: 34
line: 1
col: 35
}
roles: IMPORT_PATH
roles: SIMPLE_IDENTIFIER
}
children {
internal_type: "ImportFrom.module"
properties {
key: "promotedPropertyString"
value: "true"
}
token: "collections"
roles: IMPORT_PATH
roles: SIMPLE_IDENTIFIER
}
token: "collections"
start_position {
offset: 5
line: 1
col: 6
}
end_position {
offset: 34
line: 1
col: 35
}
roles: IMPORT_DECLARATION
roles: STATEMENT
If you extract UAST for
t = set()
t = {0,1}
You have next tree:
# Token Internal Role Roles Tree
|| Module FILE
1 || Assign ┣ BINARY, THIS, EXPRESSION
1 |t| Name ┃ ┣ LEFT, IDENTIFIER, EXPRESSION
1 || Call ┃ ┣ FUNCTION, CALLEE, EXPRESSION, RIGHT
1 |set| Name ┃ ┗ ┗ CALLEE, POSITIONAL, IDENTIFIER, EXPRESSION
2 || Assign ┣ BINARY, THIS, EXPRESSION
2 |t| Name ┃ ┣ LEFT, IDENTIFIER, EXPRESSION
2 || Set ┃ ┣ BYTE, STRING, EXPRESSION, RIGHT
2 |0| Num ┃ ┃ ┣ BYTE, REGEXP, EXPRESSION
2 |1| Num ┗ ┗ ┗ BYTE, REGEXP, EXPRESSION
What I expect to see is:
# Token Internal Role Roles Tree
|| Module FILE
1 || Assign ┣ BINARY, ~THIS~, EXPRESSION, +Assignment+
1 |t| Name ┃ ┣ LEFT, IDENTIFIER, EXPRESSION
1 || Call ┃ ┣ FUNCTION, ~CALLEE~, EXPRESSION, RIGHT, +CALL+
1 |set| Name ┃ ┗ ┗ CALLEE, ~POSITIONAL~, IDENTIFIER, EXPRESSION, +Name+
2 || Assign ┣ BINARY, ~THIS~, EXPRESSION, +Assignment+
2 |t| Name ┃ ┣ LEFT, IDENTIFIER, EXPRESSION
2 || Set ┃ ┣ ~BYTE~, ~STRING~, EXPRESSION, RIGHT, +SET+, ?TYPE?
2 |0| Num ┃ ┃ ┣ ~BYTE~, ~REGEXP~, EXPRESSION, +NUMBER+, +VALUE+
2 |1| Num ┗ ┗ ┗ ~BYTE~, ~REGEXP~, EXPRESSION, +NUMBER+, +VALUE+
Legend:
+ROLE+ -- add Role
~ROLE~ -- remove Role
?ROLE? -- maybe add/remove Role
Gist to generate UAST Roles visualization: https://gist.github.com/zurk/d314d67d9aac8843d3776c82cd738b40
I try to run bblfsh python client and it fails (change end point if you need):
from bblfsh import BblfshClient
BblfshClient("172.17.0.1:9432").parse('./TickType.py', language='Python', )
Here is file example: TickType.py.zip
The output I get:
ERROR:root:Exception deserializing message!
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/grpc/_common.py", line 129, in _transform
return transformer(message)
google.protobuf.message.DecodeError: Error parsing message
Traceback (most recent call last):
File "bug_exapmle.py", line 3, in <module>
BblfshClient("172.17.0.1:9432").parse('./temp/TickType.py', language='Python', )
File "/usr/local/lib/python3.5/dist-packages/bblfsh/client.py", line 58, in parse
response = self._stub.Parse(request, timeout=timeout)
File "/usr/local/lib/python3.5/dist-packages/grpc/_channel.py", line 507, in __call__
return _end_unary_response_blocking(state, call, False, deadline)
File "/usr/local/lib/python3.5/dist-packages/grpc/_channel.py", line 455, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.INTERNAL, Exception deserializing response!)>
And bblfsh server does not fails in this example if I run it directly.
Seems to be some problem in grpc module...
Hi,
I tried to extract UAST from python file and got an unexpected result - result of extraction starts with the
error message.
The content of exp.py
:
# Here you can find the code of /app/exp.py:
# find x in range (-2, 2) that should minimize math.sin function
import math
from hyperopt import fmin, tpe, hp
from hyperopt.mongoexp import MongoTrials
mongodb = "mongodb"
db_port = "27017"
db_name = "ml_exp"
exp_key = "exp1"
trials = MongoTrials("mongo://%s:%s/%s/jobs" % (mongodb, db_port, db_name), exp_key="exp1")
best = fmin(math.sin, hp.uniform("x", -2, 2), trials=trials, algo=tpe.suggest, max_evals=10)
print("Result:", best)
and result of extraction:
status: ERROR
errors: "column out of bounds: 0 [1, 1]"
uast {
internal_type: "Module"
...
Full UAST is attached:
exp.uast.txt
PS:
is it correct that we have several lines of roles?
roles: STRING_LITERAL
roles: EXPRESSION
roles: ASSIGNMENT_VALUE
If you extract UAST for
f()
You have next tree:
# Token Internal Role Roles Tree
|| Module FILE
1 || Expr ┣ EXPRESSION
1 || Call ┃ ┣ FUNCTION, CALLEE, EXPRESSION
1 |f| Name ┗ ┗ ┗ CALLEE, POSITIONAL, IDENTIFIER, EXPRESSION
What I expect to see is:
# Token Internal Role Roles Tree
|| Module FILE
1 || Expr ┣ EXPRESSION
1 || Call ┃ ┣ FUNCTION, ~CALLEE~, EXPRESSION, +CALL+
1 |f| Name ┗ ┗ ┗ CALLEE, ~POSITIONAL~, IDENTIFIER, EXPRESSION,+Name+
Legend:
+ROLE+ -- add Role
~ROLE~ -- remove Role
?ROLE? -- maybe add/remove Role
Gist to generate UAST Roles visualization: https://gist.github.com/zurk/d314d67d9aac8843d3776c82cd738b40
If you extract UAST for
(0,)
You have next tree:
# Token Internal Role Roles Tree
|| Module FILE
1 || Expr ┣ EXPRESSION
1 || Tuple ┃ ┣ BYTE, TYPE, EXPRESSION
1 |0| Num ┗ ┗ ┗ BYTE, REGEXP, EXPRESSION
What I expect to see is:
# Token Internal Role Roles Tree
|| Module FILE
1 || Expr ┣ EXPRESSION
1 || Tuple ┃ ┣ ~BYTE~, ?TYPE?, EXPRESSION, +TUPLE+
1 |0| Num ┗ ┗ ┗ ~BYTE~, ~REGEXP~, EXPRESSION, +NUMBER+, +VALUE+
Legend:
+ROLE+ -- add Role
~ROLE~ -- remove Role
?ROLE? -- maybe add/remove Role
Gist to generate UAST Roles visualization: https://gist.github.com/zurk/d314d67d9aac8843d3776c82cd738b40
Some nodes (probably the ones originating from a property in the Python native AST, without location info) have the Column at 0, which is wrong. They should take the Col from their parent node in the case of properties.
This blocks the fixing of the offset with the TransformationUASTParser
.
Hi,
I tested https://github.com/EgorBu/uast_playground at a new file
python3 -m uast_playground repo2id_str -r test_data/exp2
I found identifiers with empty tokens.
https://gist.github.com/zurk/8f7dd974347925ae62c31d9441491613
Here you can find my example.
run run_me.py
and you get output
for example.py
I think that all identifiers cannot be with empty tokens.
Or am I wrong?
I found it in my visualization example:
https://gist.github.com/zurk/d314d67d9aac8843d3776c82cd738b40
8 || ┣ BINARY, THIS, EXPRESSION
8 |var1| ┃ ┣ LEFT, IDENTIFIER, EXPRESSION
7 || ┃ ┃ ┣ INCOMPLETE
7 |\n| ┃ ┃ ┗ ┗ DOCUMENTATION
8 |<nil>| ┃ ┗ BYTE, NUMBER, EXPRESSION, RIGHT
If you take a look at the code, you will see that actually corresponds to None token.
Also, I doubt that None should have a NUMBER
role, maybe NULL
?
if you get UAST for
import lib1
lib1.lib2.lib3.var = None
You will have next list of Identifiers with roles:
<lib1>: ['IMPORT', 'PATHNAME', 'IDENTIFIER']
<var>: ['IDENTIFIER', 'EXPRESSION', 'LEFT']
<lib3>: ['IDENTIFIER', 'EXPRESSION']
<lib2>: ['IDENTIFIER', 'EXPRESSION']
<lib1>: ['IDENTIFIER', 'QUALIFIED', 'IDENTIFIER', 'EXPRESSION']
QUALIFIED
role is missing in lib2
and lib3
identifiers.
you can use code from issue #94 to reproduse result.
this gist: https://gist.github.com/zurk/8f7dd974347925ae62c31d9441491613
It is not published in github, but the latest here https://hub.docker.com/r/bblfsh/python-driver/tags/
I will give an example from dashboard:
As you can see there are completely strange roles in bblfsh.
If I run version v0.8.2 using BBLFSH_DRIVER_IMAGES="python=docker://bblfsh/python-driver:v0.8.2"
everything is fine.
Maybe It is erroneous release...
For example, this file: https://github.com/damoeb/kalipo/blob/master/kalipo-ir/harvester/spiders/heise_spider.py produces a TabError/SyntaxError (Python 3/2) but the only error that the driver returns is "Could not determine Python version", which is true, but also incomplete; the original SyntaxError should be added to the driver response.
When you have an If which don't have the usual form like a binary expression the role IfCondition dont appear in the UAST, for an input like this:
if True:
print(True)
or
if functionCallThatReturnsABoolean(){
print(True)
}
the uast roles that we expect is something like this
If{
IfCondition,
IfBody
}
and we get this
If{
IfBody
expresion/functionCall
}
Code inside Python's fstrings have it's position starting in line 1 column 1 with disregard of the real position. Some of these (line numbers) are fixed by the synchronized tokenizer, but not in all cases are columns fixed.
If you extract UAST for range(10)
you have next tree:
line# token Roles
|| FILE
1 || ┣ EXPRESSION
1 || ┃ ┣ FUNCTION, CALLEE, EXPRESSION
1 |10| ┃ ┃ ┣ EXPRESSION, FUNCTION, DECLARATION, ARGUMENT, NAME, IDENTIFIER, CALLEE, ARGUMENT, NOOP
1 |range| ┗ ┗ ┗ CALLEE, POSITIONAL, IDENTIFIER, EXPRESSION
What I expect to see is:
line# token Roles
|| FILE
1 || ┣ EXPRESSION
1 || ┃ ┣ FUNCTION, CALLEE, EXPRESSION
1 |10| ┃ ┃ ┣ NUMBER, EXPRESSION, ARGUMENT, NAME, IDENTIFIER, ARGUMENT, POSITIONAL, VALUE
1 |range| ┗ ┗ ┗ CALLEE, IDENTIFIER
not sure about EXPRESSION
role. It is too common.
Also, I have another experiment and find out that if you parse just 10
you have:
|| FILE
1 || ┣ EXPRESSION
1 |10| ┗ ┗ BYTE, REGEXP, EXPRESSION
What I expect to see is:
|| FILE
1 || ┣ EXPRESSION
1 |10| ┗ ┗ NUMBER, EXPRESSION, VALUE
BTW, what about MODULE
role? Each file in python is considered as the module, or I am wrong?
From:
Node endpositions are not mandatory in the current spec if the native driver doesn't provide them as happen with the Python driver, but it would be nice to have them on this driver.
CC @dpordomingo
I have strange error trying to get UAST from this file: oo.py.zip
I run this code
from bblfsh.client import BblfshClient
bc = BblfshClient("0.0.0.0:9432")
res = bc.parse("./oo.py", language='Python')
print(res)
Output:
status: FATAL
errors: "expected object of type map[string]interface{}, got: \"NoneLiteral\""
but py file seems to be correct because I can run python3 ./oo.py
without any problem
Also, bblfsh server is running in docker and output only
time="2017-08-08T20:34:55Z" level=info msg="parsing oo.py (34525 bytes)"
Hi,
ex:
with open(os.path.join(args.output, "row_vocab.txt"), "w") as out:
out.write('\n'.join(chosen_words))
and UAST containes node with emty token, wrong position (0,0):
internal_type: "With.items"
properties {
key: "promotedPropertyList"
value: "true"
}
children {
internal_type: "withitem"
children {
internal_type: "Name"
properties {
key: "ctx"
value: "Load"
}
properties {
key: "internalRole"
value: "context_expr"
}
token: "a"
start_position {
offset: 5
line: 1
col: 6
}
end_position {
offset: 5
line: 1
col: 6
}
roles: SIMPLE_IDENTIFIER
roles: EXPRESSION
}
children {
internal_type: "Name"
properties {
key: "ctx"
value: "Store"
}
properties {
key: "internalRole"
value: "optional_vars"
}
token: "b"
start_position {
offset: 10
line: 1
col: 11
}
end_position {
offset: 10
line: 1
col: 11
}
roles: SIMPLE_IDENTIFIER
roles: EXPRESSION
}
start_position {
line: 1
col: 1
}
end_position {
offset: 10
line: 1
col: 11
}
roles: SIMPLE_IDENTIFIER
roles: INCOMPLETE
}
roles: SIMPLE_IDENTIFIER
roles: EXPRESSION
roles: INCOMPLETE
I wrote a small tool for collecting statistics for number of nodes w.r.t. number of node roles in UASTs. It turned out that for my dataset there're some cases when no roles are assigned to a UAST node.
Repositories: /storage/timofei/repos
Extracted UASTs: /storage/timofei/uasts
Collected statistics: uasts_stat.txt
List of suspicious UASTs (csv file with columns: path to UAST, total number of nodes, number of nodes without roles): uasts_susp.txt
Sending an empty file to the Python driver produces a fatal error (with by definition stops the driver from processing more requests). This shouldn't be so since empty files are common in Python (init.py), it should just produce an error with an empty UAST returned.
Merge bblfsh/python-client#38 to print errors
Then
git clone https://github.com/pallets/flask
python3 -m bblfsh -f flask/examples/minitwit/minitwit/minitwit.py >/dev/null
And you get an error from bblfsh:
column out of bounds: 63 [1, 51]
The file is parsed though.
Hi,
I tried to extract UAST from python code and noticed when you define a function:
def a(b, c): ...
the node for this function will have roles: ‘FUNCTION_DECLARATION_BODY’
, ‘FUNCTION_DECLARATION_RECEIVER’
, but not 'SIMPLE_IDENTIFIER'
.
In the documentation it's mentioned:
// SimpleIdentifier is the most basic form of identifier, used for variable
// names, functions, packages, etc.
I think that this node should have 'SIMPLE_IDENTIFIER'
role.
If you get UAST for len(x)
you will get no CALL_CALLEE
child in CALL
Node but the documentation (http://godoc.org/github.com/bblfsh/sdk/uast#Role) says:
// Call is any call, whether it is a function, procedure, method or macro.
// In its simplest form, a call will have a single child with a function
// name (CallCallee).
So, as I understand any CALL
should have CALL_CALLEE
child.
In the meeting planning we decided to split the functionality in the current pydetector.astexport.py
module between the retrieval of the native AST data structure unmodified (but for the right Python version) in pydetector
and the visitor + noop extractor + position updater in python-driver
, reusing the data returned from pydetector
to avoid doing a double parsing.
Hi,
I made some experiments and found bug - SIMPLE_IDENTIFIER nodes have the same positions.
Reproducible example:
a += b.c["Some val"] \
.d
And uast_playground gives us:
# New token 'b' at position (1, 6) has the same position as token 'd' at the same position. Skip new token.
a += b.c["Some val"] \
# Something wrong with token 'd' at pos (1, 6) - it's not equal to 'b' at this position in code
.d
It looks like that it happens because of line continuation because in case of code:
a += b.c["Some val"].d
everything works well.
BTW: it looks like that d
is higher in UAST than b
- is it correct? Because it appears earlier during traversing of UAST
If you have this code:
a = b = c
var1 == var2 == var3
var4 < var5 < var6
And run UAST extraction you have strange role assignment.
Please take a look at the code. https://gist.github.com/zurk/66a3045746287bdb5002c0812b94f611
Here is output (the same gist):
https://gist.github.com/zurk/66a3045746287bdb5002c0812b94f611#file-output
Comments for output:
//*[@roleIdentifier] :
<a>: ['LEFT', 'IDENTIFIER', 'EXPRESSION']
<b>: ['LEFT', 'IDENTIFIER', 'EXPRESSION']
<c>: ['RIGHT', 'IDENTIFIER', 'EXPRESSION']
<var2>: ['IDENTIFIER', 'EXPRESSION']
<var3>: ['IDENTIFIER', 'EXPRESSION']
<var1>: ['IDENTIFIER', 'EXPRESSION', 'EXPRESSION', 'BINARY', 'LEFT']
<var5>: ['IDENTIFIER', 'EXPRESSION']
<var6>: ['IDENTIFIER', 'EXPRESSION']
<var4>: ['IDENTIFIER', 'EXPRESSION', 'EXPRESSION', 'BINARY', 'LEFT']
I am not sure how it should be but at least var3
and var6
are on right side. :)
Why we have 'EXPRESSION', 'BINARY'
for var1
and var4
? I think it is just EXPRESSION
as for all others. BINARY
is upper level in UAST tree.
Also, I am not sure that we can call second and last expressions as binary
at all.
Maybe, the first line of code can be considered as two binary expressions.
//*[@roleLeft] :
<a>: ['LEFT', 'IDENTIFIER', 'EXPRESSION']
<b>: ['LEFT', 'IDENTIFIER', 'EXPRESSION']
<var1>: ['IDENTIFIER', 'EXPRESSION', 'EXPRESSION', 'BINARY', 'LEFT']
<var4>: ['IDENTIFIER', 'EXPRESSION', 'EXPRESSION', 'BINARY', 'LEFT']
Ok, it can be true for 'a' and 'b'.
//*[@roleRight] :
<c>: ['RIGHT', 'IDENTIFIER', 'EXPRESSION']
<>: ['EXPRESSION', 'BINARY', 'RIGHT']
<>: ['EXPRESSION', 'BINARY', 'RIGHT']
tokens var4
and var6
missing?
//*[@roleBinary] :
<>: ['BINARY', 'THIS', 'EXPRESSION']
<>: ['EXPRESSION', 'BINARY']
<>: ['EXPRESSION', 'BINARY', 'RIGHT']
<var1>: ['IDENTIFIER', 'EXPRESSION', 'EXPRESSION', 'BINARY', 'LEFT']
<>: ['EXPRESSION', 'BINARY', 'OPERATOR']
<==>: ['BINARY', 'OPERATOR', 'EQUAL']
<==>: ['BINARY', 'OPERATOR', 'EQUAL']
<>: ['EXPRESSION', 'BINARY']
<>: ['EXPRESSION', 'BINARY', 'RIGHT']
<var4>: ['IDENTIFIER', 'EXPRESSION', 'EXPRESSION', 'BINARY', 'LEFT']
<>: ['EXPRESSION', 'BINARY', 'OPERATOR']
<<>: ['BINARY', 'OPERATOR', 'LESS_THAN']
<<>: ['BINARY', 'OPERATOR', 'LESS_THAN']
Everything is fine for a = b = c
statement maybe except THIS
role, but I am not sure. Please take a look at the defenition and if it suitable here.
And there is a mess for second two lines of code.
<>: ['EXPRESSION', 'BINARY', 'OPERATOR']
seems that it is the node for the full ternary operator because it is without a token. Not sure.
Hope it helps to investigate the problem.
Related to this issue bblfsh/bblfshd#101 (actually problem was no in the server but in python-driver). I have kind of the same symptoms with a new driver. At some moment the server logs:
time="2017-09-22T15:41:39Z" level=debug msg="Empty code received, returning empty UAST"
And then nothing, but my program actually continues to send queries. At some moment (~after 30sec) server logs
time="2017-09-22T15:42:11Z" level=debug msg="driver exited without error"
and then you can actually continue parsing.
I couldn't find the file which breaks everything, and also if I run 1 thread for queries, everything seems to be fine.
Code to reproduce:
https://gist.github.com/zurk/2d9e786e6577ebe60e963091c13b4ecd
files.txt
they are on science-3
. Can download it and attach if you want.
Depends on: bblfsh/sdk#153
This way we can avoid uploading new versions to pypi for testing changes affecting the Python driver. It could work this way: if a directory exists in the local directory (can be a symbolic link), it would use it. If not it would download from PyPi.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.