deib-geco / pygmql Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 5.0 44.39 MB

Python Library for data analysis based on GMQL

License: Apache License 2.0

Python 99.28% Shell 0.44% Batchfile 0.01% Dockerfile 0.27%

genomics python big-data bedtools java anaconda pypi jupyter-notebook gmql binder

pygmql's Introduction

PyGMQL

API for calling interactively the GMQL Engine from Python

Documentation

The doucumentation can be found at the following link: http://pygmql.readthedocs.io

Examples

In the examples folder are available the Notebooks and the scripts to reproduce the analysis presented in the manuscript.

Docker image

If you want to run some of the examples provided in the example folder you can directly install the PyGMQL docker image.

docker pull gecopolimi/pygmql

You can run the docker instance using the following command:

docker run --rm \
           --name pygmql_instance \
           -p <port>:8888 \
           gecopolimi/pygmql

where you can set <port> to any free port number on your machine. This will start a Jupyter Lab server which will run at the address

https://localhost:<port>

Inside the docker you will find the example folder containing both notebooks and scripts.

Get in touch

You can ask questions or provide some feedback through our Gitter channel.

Requirements

The library requires the following:

Python 3.4+
The latest version of JAVA installed
The JAVA_HOME variable set to the Java installation folder (example: C:\Program Files\Java\jdk1.8.0_161 or ~/jdk1.8.0_161)

Installation

From github

First of all download this repository in a choosen location:

git clone https://github.com/DEIB-GECO/PyGMQL.git

Than go inside the library folder and install the package as follows:

cd PyGMQL
pip install -e .

From PyPi

pip install gmql

Setup

Use Anaconda

We suggest to manage your python distribution through Anaconda. The latest version of can be downloaded from https://www.continuum.io/downloads.

Once your Anaconda distribution is installed, let's create a brand new environment:

conda create --name pygmql python=3

Check if JAVA is installed

Check that the JAVA_HOME enviroment variable is correctly set to the latest JAVA distribution.

echo $JAVA_HOME

If the variable is not set (the previous command does not show nothing), you may need to install JAVA (https://www.java.com/it/download/) and then set JAVA_HOME like the following:

On linux:

echo export "JAVA_HOME=/path/to/java" >> ~/.bash_profile
source ~/.bash_profile

On Mac:

echo export "JAVA_HOME=\$(/usr/libexec/java_home)" >> ~/.bash_profile
source ~/.bash_profile

On Windows:

Right click My Computer and select Properties.
On the Advanced tab, select Environment Variables, and then edit JAVA_HOME to point to where the JDK software is located, for example, C:\Program Files\Java\jdk1.6.0_02.

Use it in Jupyter Notebooks

We strongly suggest to use the library with the support of a Jupyter Notebook for the best graphical rendering of the data structures. It may be necessary to manually install the Jupyter kernel:

source activate pygmql
python -m ipykernel install --user --name pygmql --display-name "Python (pygmql)"

Keep the code updated

This is a constantly evolving project. Therefore the library will be constantly added with new features. Therefore we suggest to update your local copy periodically:

cd PyGMQL
git pull

pygmql's People

Contributors

Stargazers

Watchers

Forkers

erlaad gitter-badger marzyunicz diveu harel-coffee

pygmql's Issues

GDataframe to BedTools converter

Provide a way to convert a GDataframe object to a Pybedtools BedTool object.

Code coverage

All the operators must be covered by the testing procedure.
Execution of example queries both locally and remotely with check of result consistency

Conflict between pygmql and ElementTree package

Error while pygmql is imported together with ElementTree

File "/home/leone/test/Cost_of_Fuel.py", line 8, in <module>
    import gmql as gl

  File "/home/leone/anaconda3/lib/python3.7/site-packages/gmql/__init__.py", line 31, in <module>
    __init()

  File "/home/leone/anaconda3/lib/python3.7/site-packages/gmql/__init__.py", line 27, in __init
    __init_managers()

  File "/home/leone/anaconda3/lib/python3.7/site-packages/gmql/managers.py", line 302, in init_managers
    __check_dependencies()

  File "/home/leone/anaconda3/lib/python3.7/site-packages/gmql/managers.py", line 288, in __check_dependencies
    __gmql_jar_path = __dependency_manager.resolve_dependencies()

  File "/home/leone/anaconda3/lib/python3.7/site-packages/gmql/FileManagment/DependencyManager.py", line 126, in resolve_dependencies
    resp = self._parse_dependency_info_fromstring(resp_text)

  File "/home/leone/anaconda3/lib/python3.7/site-packages/gmql/FileManagment/DependencyManager.py", line 75, in _parse_dependency_info_fromstring
    tree = ET.ElementTree(ET.fromstring(s))

  File "/home/leone/anaconda3/lib/python3.7/xml/etree/ElementTree.py", line 1315, in XML
    parser.feed(text)

  File "<string>", line unknown
ParseError: mismatched tag: line 13, column 4

Cannot upload a dataset if it doesn't have a schema

Currently the library cannot upload to the remote server datasets which were loaded using a custom parser. The library should handle this case automatically.

No .gdm extension in remote query

Not able to install

I am using ubuntu 16.04 system
Java version openjdk version "1.8.0_292"
echo $JAVA_HOME

/usr/bin/java
I created a separate conda environment with python 3.6 and and installed Jupyter note book
Then by using Jupyter note book I did
!pip install gmql

it get installed successfully

now when I am trying to import

import gmql as gl

it is showing this error
File "<string>", line unknown ParseError: mismatched tag: line 13, column 4

what mistake I am doing. I am bit new in python, please help

PythonAPI.jar is corrupted by Github

When uploaded on Github, the JAR file containing the Scala Wrappers of GMQL is corrupted by the PUSH.

Cover Parameters

For minAcc/maxAcc only "ANY" is accepted (not "any", "Any",...). Make it case insensitive.
type parameter: chance "normal" in "cover" and made it default value
complete the documentation for minAcc/maxAcc, specifying that also the keyword Any and All are accepted

import gmql failed

I cloned git , and tryed to import gmql.
I tried it on centos7 and centos8 and ubuntu18.04 .
I also tried it on python 3.7.4 and python 3.6.8.
When I tried to run it I got:

$ echo $JAVA_HOME
/usr/java/default
$ which java
/usr/java/default/bin/java

$ pip list |grep gmql
gmql                          0.1.1
$ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gmql
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/miniconda3/lib/python3.7/site-packages/gmql/__init__.py", line 31, in <module>
    __init()
  File "/usr/local/miniconda3/lib/python3.7/site-packages/gmql/__init__.py", line 27, in __init
    __init_managers()
  File "/usr/local/miniconda3/lib/python3.7/site-packages/gmql/managers.py", line 302, in init_managers
    __check_dependencies()
  File "/usr/local/miniconda3/lib/python3.7/site-packages/gmql/managers.py", line 288, in __check_dependencies
    __gmql_jar_path = __dependency_manager.resolve_dependencies()
  File "/usr/local/miniconda3/lib/python3.7/site-packages/gmql/FileManagment/DependencyManager.py", line 126, in resolve_dependencies
    resp = self._parse_dependency_info_fromstring(resp_text)
  File "/usr/local/miniconda3/lib/python3.7/site-packages/gmql/FileManagment/DependencyManager.py", line 75, in _parse_dependency_info_fromstring
    tree = ET.ElementTree(ET.fromstring(s))
  File "/usr/local/miniconda3/lib/python3.7/xml/etree/ElementTree.py", line 1315, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: mismatched tag: line 12, column 4
>>>

I think something changed external environment used for initialization.
Please check it.
Best regards.

Error when a field of the dataset is called index

When a dataset that has the region field index the following error occurs:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-15-c63d645cfc40> in <module>()
      1 all_h3k27ac_peaks_gmql = gl.from_pandas(all_h3k27ac_peaks, chr_name="chr", 
      2                                         start_name="start", stop_name="end",
----> 3                                         strand_name="strand").to_GMQLDataset()

~/software/PyGMQL/gmql/dataset/GDataframe.py in to_GMQLDataset(self, local_path, remote_path)
     75 
     76         if local is not None:
---> 77             return Loader.load_from_path(local_path=local)
     78         elif remote is not None:
     79             raise NotImplementedError("The remote loading is not implemented yet!")

~/software/PyGMQL/gmql/dataset/loaders/Loader.py in load_from_path(local_path, parser, all_load)
     89                                        location="local", path_or_name=local_path,
     90                                        local_sources=local_sources,
---> 91                                        meta_profile=meta_profile)
     92 
     93 

~/software/PyGMQL/gmql/dataset/GMQLDataset.py in __init__(self, parser, index, location, path_or_name, local_sources, remote_sources, meta_profile)
     56         # setting the schema as properties of the dataset
     57         for field in self.schema:
---> 58             self.__setattr__(field, self.RegField(field))
     59         # add also left and right
     60         self.left = self.RegField("left")

~/software/PyGMQL/gmql/dataset/GMQLDataset.py in RegField(self, name)
    127         :return: a RegField instance
    128         """
--> 129         return RegField(name=name, index=self.index)
    130 
    131     def select(self, meta_predicate=None, region_predicate=None,

~/software/PyGMQL/gmql/dataset/DataStructures/RegField.py in __init__(self, name, index, region_condition, reNode)
     12         self.region_condition = region_condition
     13         pymg = get_python_manager()
---> 14         self.exp_build = pymg.getNewExpressionBuilder(self.index)
     15         # check that the name is not already complex
     16         if not (name.startswith("(") and name.endswith(")")) and reNode is None:

~/anaconda3/envs/bio/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1149 
   1150     def __call__(self, *args):
-> 1151         args_command, temp_args = self._build_args(*args)
   1152 
   1153         command = proto.CALL_COMMAND_NAME +\

~/anaconda3/envs/bio/lib/python3.6/site-packages/py4j/java_gateway.py in _build_args(self, *args)
   1119 
   1120         args_command = "".join(
-> 1121             [get_command_part(arg, self.pool) for arg in new_args])
   1122 
   1123         return args_command, temp_args

~/anaconda3/envs/bio/lib/python3.6/site-packages/py4j/java_gateway.py in <listcomp>(.0)
   1119 
   1120         args_command = "".join(
-> 1121             [get_command_part(arg, self.pool) for arg in new_args])
   1122 
   1123         return args_command, temp_args

~/anaconda3/envs/bio/lib/python3.6/site-packages/py4j/protocol.py in get_command_part(parameter, python_proxy_pool)
    288             command_part += ";" + interface
    289     else:
--> 290         command_part = REFERENCE_TYPE + parameter._get_object_id()
    291 
    292     command_part += "\n"

AttributeError: 'RegField' object has no attribute '_get_object_id'

Bug: semi-join operation in meta_select()

I obtained the following error (ValueError: Code 406. Dataset not found: null) while materializing the result of the query below (I attach both the notebook and the .py script). The issue is probably related to the semi-join operation that doesn't filter the samples properly and it seems to return a null dataset.

Query.zip

Add support for GSEA

Using gseapy enable the extraction of set of genes from the data structures of the system and give to the user routines for enrichment analysis.

Datasets in GTF format make the query fail in remote mode

When a dataset in GTF format is used in a remote query the following happens.

Example query:

import gmql as gl
gl.set_remote_address("http://gmql.eu/gmql-rest/")
gl.login()
gl.set_mode("remote")
d1 = gl.load_from_remote("Example_Dataset_1", owner="public")
r = d1.materialize()

The following exception is raised:

Traceback (most recent call last):
  File "C:/Users/lucan/Documents/progetti_phd/PyGMQL/test/test_map.py", line 8, in <module>
    r = d1.materialize()
  File "C:\Users\lucan\Documents\progetti_phd\PyGMQL\gmql\dataset\GMQLDataset.py", line 1191, in materialize
    return Materializations.materialize_remote(new_index, output_name, output_path, all_load)
  File "C:\Users\lucan\Documents\progetti_phd\PyGMQL\gmql\dataset\loaders\Materializations.py", line 83, in materialize_remote
    result = remote_manager.execute_remote_all(output_path=download_path)
  File "C:\Users\lucan\Documents\progetti_phd\PyGMQL\gmql\RemoteConnection\RemoteManager.py", line 520, in execute_remote_all
    return self._execute_dag(serialized_dag, output, output_path)
  File "C:\Users\lucan\Documents\progetti_phd\PyGMQL\gmql\RemoteConnection\RemoteManager.py", line 553, in _execute_dag
    self.download_dataset(dataset_name=name, local_path=path)
  File "C:\Users\lucan\Documents\progetti_phd\PyGMQL\gmql\RemoteConnection\RemoteManager.py", line 379, in download_dataset
    return self.download_as_stream(dataset_name, local_path)
  File "C:\Users\lucan\Documents\progetti_phd\PyGMQL\gmql\RemoteConnection\RemoteManager.py", line 402, in download_as_stream
    samples = self.get_dataset_samples(dataset_name)
  File "C:\Users\lucan\Documents\progetti_phd\PyGMQL\gmql\RemoteConnection\RemoteManager.py", line 226, in get_dataset_samples
    return self.process_info_list(res, "info")
  File "C:\Users\lucan\Documents\progetti_phd\PyGMQL\gmql\RemoteConnection\RemoteManager.py", line 188, in process_info_list
    res = pd.concat([res, pd.DataFrame.from_dict(res[info_column].map(extract_infos).tolist())], axis=1)\
  File "C:\Users\lucan\Anaconda3\envs\bio\lib\site-packages\pandas\core\frame.py", line 2139, in __getitem__
    return self._getitem_column(key)
  File "C:\Users\lucan\Anaconda3\envs\bio\lib\site-packages\pandas\core\frame.py", line 2146, in _getitem_column
    return self._get_item_cache(key)
  File "C:\Users\lucan\Anaconda3\envs\bio\lib\site-packages\pandas\core\generic.py", line 1842, in _get_item_cache
    values = self._data.get(item)
  File "C:\Users\lucan\Anaconda3\envs\bio\lib\site-packages\pandas\core\internals.py", line 3843, in get
    loc = self.items.get_loc(item)
  File "C:\Users\lucan\Anaconda3\envs\bio\lib\site-packages\pandas\core\indexes\base.py", line 2527, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'info'

Process finished with exit code 1

This is the exception raised by the GMQL server (netty log):

2018-03-16 11:20:38,557 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 - 18/03/16 11:20:38 ERROR GMQLSparkExecutor: empty.reduceLeft
2018-03-16 11:20:38,557 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 - java.lang.UnsupportedOperationException: empty.reduceLeft
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180)
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at scala.collection.AbstractTraversable.reduceLeft(Traversable.scala:104)
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208)
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at scala.collection.AbstractTraversable.reduce(Traversable.scala:104)
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at it.polimi.genomics.profiling.Profilers.Profiler$.profile(Profiler.scala:147)
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at it.polimi.genomics.spark.implementation.GMQLSparkExecutor$$anonfun$implementation$1.apply(GMQLSparkExecutor.scala:144)
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at it.polimi.genomics.spark.implementation.GMQLSparkExecutor$$anonfun$implementation$1.apply(GMQLSparkExecutor.scala:112)
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:73)
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at it.polimi.genomics.spark.implementation.GMQLSparkExecutor.implementation(GMQLSparkExecutor.scala:112)
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at it.polimi.genomics.spark.implementation.GMQLSparkExecutor.go(GMQLSparkExecutor.scala:59)
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at it.polimi.genomics.GMQLServer.GmqlServer.run(GmqlServer.scala:23)
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at it.polimi.genomics.cli.GMQLExecuteCommand$.main(GMQLExecuteCommand.scala:265)
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at it.polimi.genomics.cli.GMQLExecuteCommand.main(GMQLExecuteCommand.scala)
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2018-03-16 11:20:38,558 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at java.lang.reflect.Method.invoke(Method.java:498)
2018-03-16 11:20:38,559 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
2018-03-16 11:20:38,559 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
2018-03-16 11:20:38,559 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
2018-03-16 11:20:38,559 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
2018-03-16 11:20:38,559 [INFO] from org.apache.spark.launcher.app.GMQLExecuteCommand in launcher-proc-17 -      at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Add support for GTF files

Need to support the loading of dataset in local memory that are composed of GTF files.

GIVE ERROR WHEN AN UNKNOWN REGION VALUE IS SPECIFIED IN AGGREGATES!

More than one instance at the same time

Make possible to run multiple PyGMQL programs at the same time.

Can't run even the example notebooks

Hi! I've been trying to get into this API, but I keep getting stuck here :(

I keep getting errors in Docker Image you provided in the Readme.

An error occurred while calling z:it.polimi.genomics.pythonapi.PythonManager.take.
: java.lang.IllegalArgumentException: System memory 466092032 must be at least 471859200. Please increase heap size using the --driver-memory option or spark.driver.memory in Spark configuration.

And I can't find any way to fix it :(

And when I'm trying to do things locally, having all of dependencies resolved, in 03a_GWAS_Local I keep getting this error :(

Py4JJavaError Traceback (most recent call last)
in
----> 1 gwas.head().regs

~/anaconda3/lib/python3.7/site-packages/gmql/dataset/GMQLDataset.py in head(self, n)
1400 current_mode = get_mode()
1401 new_index = self.__modify_dag(current_mode)
-> 1402 collected = self.pmg.take(new_index, n)
1403 regs = MemoryLoader.load_regions(collected)
1404 meta = MemoryLoader.load_metadata(collected)

~/anaconda3/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args)
1284 answer = self.gateway_client.send_command(command)
1285 return_value = get_return_value(
-> 1286 answer, self.gateway_client, self.target_id, self.name)
1287
1288 for temp_arg in temp_args:

~/anaconda3/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(

Py4JJavaError: An error occurred while calling z:it.polimi.genomics.pythonapi.PythonManager.take.
: java.lang.ExceptionInInitializerError
at org.apache.hadoop.util.StringUtils.(StringUtils.java:79)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:116)
at org.apache.hadoop.security.Groups.(Groups.java:93)
at org.apache.hadoop.security.Groups.(Groups.java:73)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2430)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2430)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2430)
at org.apache.spark.SparkContext.(SparkContext.scala:295)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at it.polimi.genomics.pythonapi.PythonManager$.startSparkContext(PythonManager.scala:394)
at it.polimi.genomics.pythonapi.PythonManager$.checkSparkContext(PythonManager.scala:387)
at it.polimi.genomics.pythonapi.PythonManager$.take(PythonManager.scala:340)
at it.polimi.genomics.pythonapi.PythonManager.take(PythonManager.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:567)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.base/java.lang.Thread.run(Thread.java:835)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3410)
at java.base/java.lang.String.substring(String.java:1883)
at org.apache.hadoop.util.Shell.(Shell.java:50)
... 31 more

Could anybody help me, please?

GenometricSpace Visualizer

Provide a way to visualize and visually filter/inspect a GenoMetricSpace object. It must be compatible with jupyter notebooks.

Coherency between the folder structure in local and remote execution

------------ Comment added by Luca for clarity (PLEASE ADD BETTER DESCRIPTION NEXT TIME) ------------

When a query is performed locally with a statement like the following

result = dataset.materialize("/path/to/result/")

The results are stored in the /path/to/result/ path with the following structure:

/path/to/result/
    exp/
        S_00000.gdm
        S_00000.gdm.meta
        ...

while when downloading the results of a query done using the web interface the result structure is:

/path/to/result/
    info.txt
    query.txt
    vocabulary.txt
    files/
        S_00000.gdm
        S_00000.gdm.meta
        ...

and finally, when downloading the results of a remote query using the library the structure is the following:

/path/to/result/
    S_00000.gdm
    S_00000.gdm.meta
    ...

There must be a coherency between all the ways of downloading or generating datasets.

Removal of wrappers for Machine Learning libraries

In the Machine Learning module, remove all the wrapper functions around Sci-kit learn, theano, matplotlib, etc... functionalities.

A coherent set of transformation procedures (to X, y matrices/vectors for example) must be provided on the other hand.

Importing a system-wide installed module attempts to write into privileged locations

As shown:

import gmql as gl
   ...: 
   ...: 
---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
<ipython-input-6-775535dd745b> in <module>()
----> 1 import gmql as gl
      2 

/usr/local/lib/python3.6/site-packages/gmql/__init__.py in <module>()
     13 from .managers import login, logout, get_remote_address, get_session_manager
     14 
---> 15 __init_settings()
     16 __init_managers()

/usr/local/lib/python3.6/site-packages/gmql/settings.py in init_settings()
    112     global __version__, __folders
    113     __version__ = get_version()
--> 114     __folders = TempFileManager.initialize_tmp_folders()
    115     initialize_user_folder()
    116 

/usr/local/lib/python3.6/site-packages/gmql/FileManagment/TempFileManager.py in initialize_tmp_folders()
     19     for tf in tmp_folders:
     20         if not os.path.isdir(tf):
---> 21             os.mkdir(tf)
     22     result = {
     23         'tmp': tmp_folder_name,

PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.6/site-packages/gmql/resources/tmp'

On Linux, this should use the user's $HOME or (better) the XDG format specification directories (in this case ~/.local/share/gmql). Alternatively, a way to specify where to generate those temporary directories would be best.

Multiple materializations in one statement

Creation of a gmql.materialize(datasets) which enables the execution of multiple materializations, both in local and remote mode.

datasets can be:

a list of GMQLDataset: in this case, the result is loaded directly from memory and a list of GDataframe is returned in the same order as in the input
a dictionary of the type { output_path: GMQLDataset }: in this case, the result is saved on the disk at the locations specified by the dictionary keys and a dictionary of the type { output_path: GDataframe }

The execution of the query must be unified for all the materializations.

Change the implementation of the metadata structure

In this moment the Metadata structure in Pandas is implemented as a Dataframe of lists. But this makes the selection and filtering of samples based on metadata pretty difficult and verbose.

Change that needs to be done:

Encapsulate the python list of metadata value in an object (called, for example MetaCell)
Overload the operators in order to make the syntax coherent with GMQL
Overload the toString function for good visualization in Pandas

The result of an ORDER operation returns a "_group" metadata attribute instead of "_order"

Query:

# order_1
D = SELECT(region: chr == chr1) Example_Dataset_1;
D1 = EXTEND(Region_count AS COUNT()) D;
RES = ORDER(Region_count DESC; meta_top: 2) D1;
MATERIALIZE RES INTO order_1;

Code:

res = gl.get_example_dataset("Example_Dataset_1")
res = res.reg_select(res.chr == 'chr1')
res = res.extend({'Region_count': gl.COUNT()})
res = res.order(meta=['Region_count'], meta_ascending=[False], meta_top="top", meta_k=2)
res = res.materialize()

Problem with the Group function

AttributeError: 'RegionParser' object has no attribute 'otherPos'

Add parameter in materialize to perform the query remotely without downloading the result

Add a parameter download_result (by defaut True) to the materialize operator for simply executing the query remotely without downloading the result.

Wrong Metadata after LOADing a MATERILIAZEd file

After loading a dataset which was previously materialied in memory, metadata appear as strings of (seemingly) random characters. First impression is that they are hashed in some way and not dehashed when materialised again.

Example query:
import gmql as pygmql
// NARROWPATH points to ENCODE_NARROW_AUG_2017
enc_narrow_full = pygmql.load_from_path(
local_path=NARROW_PATH,
parser=pygmql.parsers.NarrowPeakParser())
test = enc_narrow_full[
(enc_narrow_full['biosample_term_name'] == 'HepG2')
& (enc_narrow_full['assay'] == 'ChIP-seq')
& (enc_narrow_full['experiment_target'] == 'MYC-human')].materialize()
print(test.meta)
test2 = test.to_GMQLDataset()
file2 = test2.materialize()
print(file2.meta)

Error using COUNT aggregation function

Using the below query:
HM_TF_rep_good = HM_TF_rep_good_0.extend({'_Region_number' : gl.COUNT()})

gives the following error:

Py4JJavaError Traceback (most recent call last)
in ()
1 #Count the regions in each sample (with the function COUNT()) and add the value in the metadata
----> 2 HM_TF_rep_good = HM_TF_rep_good_0.extend({'_Region_number' : gl.COUNT()})

/Users/eirinistamoulakatou/PyGMQL/gmql/dataset/GMQLDataset.py in extend(self, new_attr_dict)
446 op_name = item.get_aggregate_name()
447 op_argument = item.get_argument()
--> 448 regsToMeta = expBuild.getRegionsToMeta(op_name, k, op_argument)
449 aggregates.append(regsToMeta)
450 else:

/anaconda/lib/python3.6/site-packages/py4j/java_gateway.py in call(self, *args)
1152 answer = self.gateway_client.send_command(command)
1153 return_value = get_return_value(
-> 1154 answer, self.gateway_client, self.target_id, self.name)
1155
1156 for temp_arg in temp_args:

/anaconda/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
318 raise Py4JJavaError(
319 "An error occurred while calling {0}{1}{2}.\n".
--> 320 format(target_id, ".", name), value)
321 else:
322 raise Py4JError(

Py4JJavaError: An error occurred while calling o283.getRegionsToMeta.
: java.lang.IllegalArgumentException: The field null does not exists!
at it.polimi.genomics.pythonapi.operators.ExpressionBuilder.getRegionsToMeta(ExpressionBuilder.scala:203)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.base/java.lang.Thread.run(Thread.java:844)

"INT is not a region builder" error in JOIN function

with this query:

chip_promoter_0 = promoter.join(experiment=histone_marks,
                         genometric_predicate=[gl.DL(0)],
                         output='INT')

I got the following error. The same query works if the output parameter is LEFT, RIGHT or CONTIG


Py4JJavaError                             Traceback (most recent call last)
<ipython-input-107-5dbe907d20bc> in <module>()
      1 chip_promoter_0 = promoter.join(experiment=histone_marks,
      2                          genometric_predicate=[gl.DL(0)],
----> 3                          output='INT')

~/anaconda3/lib/python3.7/site-packages/gmql/dataset/GMQLDataset.py in join(self, experiment, genometric_predicate, output, joinBy, refName, expName, left_on, right_on)
    775 
    776         if isinstance(output, str):
--> 777             regionBuilder = self.opmng.getRegionBuilderJoin(output)
    778         else:
    779             raise TypeError("output must be a string. "

~/anaconda3/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

~/anaconda3/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o16493.getRegionBuilderJoin.
: java.lang.IllegalArgumentException: INT is not a region builder
	at it.polimi.genomics.pythonapi.operators.OperatorManager$.getRegionBuilderJoin(OperatorManager.scala:314)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)

IndexError when MATERIALISE-ing after long query

PyGMQL cannot materialize the result of a query due to the following IndexError:

Traceback (most recent call last):
File "TSS_query.py", line 188, in
both_strands_res.materialize(output_path=OUTFILE_PATH)
File "/Users/perna/PycharmProjects/PhD/Theo_paper/twica/PyGMQL_code/PyGMQL/gmql/dataset/GMQLDataset.py", line 854, in materialize
return Materializations.materialize_local(new_index, output_path)
File "/Users/perna/PycharmProjects/PhD/Theo_paper/twica/PyGMQL_code/PyGMQL/gmql/dataset/loaders/Materializations.py", line 55, in materialize_local
meta = MetaLoaderFile.load_meta_from_path(real_path)
File "/Users/perna/PycharmProjects/PhD/Theo_paper/twica/PyGMQL_code/PyGMQL/gmql/dataset/loaders/MetaLoaderFile.py", line 26, in load_meta_from_path
parsed.extend(list(map(lambda row: parser.parse_line_meta(key, row), lines))) # [(id, (attr_name, value)),...]
File "/Users/perna/PycharmProjects/PhD/Theo_paper/twica/PyGMQL_code/PyGMQL/gmql/dataset/loaders/MetaLoaderFile.py", line 26, in
parsed.extend(list(map(lambda row: parser.parse_line_meta(key, row), lines))) # [(id, (attr_name, value)),...]
File "/Users/perna/PycharmProjects/PhD/Theo_paper/twica/PyGMQL_code/PyGMQL/gmql/dataset/parsers/BedParser.py", line 92, in parse_line_meta
return id_record, (elems[0], elems[1])
IndexError: list index out of range

Query is attached. Data used has been downloaded from GMQL repository. Of note is that the query performs repeated MAPs and COVER, and that one of the metadata is _provenance, the use of which is unclear (see example below).

_provenance
f_new_promoters.enhancers.hm_4_proms.antibody_accession ENCAB000ANK
f_new_promoters.enhancers.hm_4_proms.antibody_accession ENCAB000ARB
f_new_promoters.enhancers.hm_4_proms.antibody_accession ENCAB000BLJ
f_new_promoters.enhancers.hm_4_proms.assay ChIP-seq
...
TSS_query.py.zip

Set master with local[X]

I cannot run with a defined number of thread gl.set_master("local[10]"). Please correct this.