GithubHelp home page GithubHelp logo

Input musn't be finnicky about varbio HOT 21 OPEN

timdiels avatar timdiels commented on August 16, 2024
Input musn't be finnicky

from varbio.

Comments (21)

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Apr 3, 2019, 17:02

Python's utf-8-sig encoding (to be used on open) may solve BOM problems, not sure whether it solves UTF16-LE though. dos2unix does convert UTF16-LE to BOM though.

But perhaps simply using YAML will support UTF16-LE etc out of the box, or at least provide helpful errors. \t as separator was a mistake, it should be replaced by something users can see when reviewing their input for errors.

It probably also helps to be explicit about rows, e.g. a likely format for a matrix is

[[null,   'col1', 'col2'],
 ['row1', 123.45, 123.45],
]

This format seems easiest for a user to spot errors in. Realistically the columns won't be aligned though, and perhaps we should have a header: ['col1', 'col2'] to avoid the initial null. YAML is also easier to create files for programatically.

E.g. this is harder to give a user-friendly error for:

[null, 'col1', 'col2',
 'row1', 123.45,
 'row2', 123.45,
 'row3', 123.45,
]

A column is missing but instead it is interpreted as:

[null, 'col1', 'col2',
 'row1', 123.45, 'row2',
 123.45, 'row3', 123.45,
]

and likely it will complain about 'row2' not being a float, which is more puzzling than 'row 1 should have 3 columns, it has 2'. Or worse we might not catch it at all and some random error is triggered in the middle of the program.

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Apr 3, 2019, 17:05

Some input errors can be fixed with a FAQ that step by step helps users fix the problem, ideally though we rely more so on our error messages giving them the steps to fix the issue.

Note that people often start from xls or csv and that is actually the cause of UTF16 shenanigans and whitespace issues. The FAQ should cover how to go from xls to YAML.

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Apr 3, 2019, 18:34

Sometimes integers are used as gene names, or the name starts with a number. This too should be possible; or the FAQ should mention to prefix those numbers with a letter. Sometimes pandas tries to be smart and loads a gene/row name or condition/column name as a number whereas we always want to treat them as strings. In YAML it's also possible that these will be loaded as numbers instead of as strings, we could convert these to str explicitly, similarly we could convert values explicitly to numbers; but better be explicit and force it upon the user to properly quote or not quote values as otherwise this automatic conversion might mask real errors (does it though?). Edit: we could convert with warning.

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Apr 3, 2019, 18:35

Do we allow for an index name? If so then [['gene', 'col1', 'col2'],... would make sense instead of having a separate header var for it.

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Apr 3, 2019, 18:44

Try each of the following input sets and see if something goes wrong, they caused trouble in coexpnetviz v2.0.0.dev2: look for the old_coexpnetviz_bugs dir, see the website wiki for the full path.

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Apr 4, 2019, 11:09

open() can handle any newlines thanks to universal newline mode which is on by default. Newlines to our code will always appear as \n.

open() defaults to using the default encoding for the platform, we should specify an encoding explicitly for consistent behaviour across platforms. pyyaml does not deal with encoding, it only accepts a stream (a file object) or str. utf-8-sig turned out to be unnecessary and it still can only read utf8; in my test utf8 can read files with a BOM just fine, in fact utf-8-sig can cause trouble when writing to existing files, so avoid it unless you want to write with BOM.

Still, in the test battery open() and open(encoding='utf-8') will handle only utf8. By autodetecting it we can handle all reasonable encodings.

My test battery includes: unix line endings, dos line endings, dos and unix line endings mixed, whitespace (spaces/newlines), whitespace with tabs, utf16-le, utf16-le with bom, utf8, utf8 with bom, utf16-le with bom, utf16-le with bom nulchars and whitespace, nulchars (hex 00). Each contains a bit of yaml which I parse with yaml.load (pyyaml).

With the below final script, we correctly parsed: dos.yaml incorrect_line_endings.yaml latin1.yaml utf16_le_bom_dos.yaml utf16le_bom.yaml utf8_bom.yaml utf8.yaml whitespace_tabs.yaml whitespace.yaml. chardet does not support any utf16 without a BOM (it detects ascii), in fact even vim displays it as a garbled mess interpreting it as utf-8, so this is unlikely to form a problem in practice; technically utf16 without a bom is allowed but it's very difficult for a program to guess which encoding it is in without a bom unless it's utf-8.

Nulchars result in this error:

unacceptable character #x0000: control characters are not allowed
  in "nullchars.yaml", position 27

Simply add to the FAQ that in this example the 27th character in the file (not counting newlines) is not allowed in YAML files.

The script used for testing:

from yaml import load, CLoader
import sys
from pprint import pprint
from chardet import detect
for file_name in sys.argv[1:]:
    print(file_name)
    try:
        with open(file_name, 'rb') as f:
            encoding = detect(f.read())['encoding']
            print('Detected', encoding)
        with open(file_name, encoding=encoding) as f:
            pprint(load(f, CLoader))
    except Exception as ex:
        print(ex)
    print()

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Apr 4, 2019, 11:20

What about error messages yielded by yaml? Are they friendly?

[['gene', 'col1', 'col2'],
 ['row1', 23.4, 12.4],
 ['row2', 56.8, 34.5],,
]

yields

while parsing a flow node
did not find expected node content
  in "invalid_yaml/valid.yaml", line 3, column 23
[['gene', 'col1', 'col2'],
 ['row1', 23,4, 12,4],
 ['row2', 56,8, 34,5
]

yields

while parsing a flow sequence
  in "invalid_yaml/valid.yaml", line 1, column 1
did not find expected ',' or ']'
  in "invalid_yaml/valid.yaml", line 5, column 1

While a user may not understand what a flow sequence is, errors point to the line and column at which the error happens, sometimes even with a helpful error such as not finding a ]. So by using yaml we do gain friendlier error messages.

More cases to test with the full application (these need to be handled by us, not yaml):

# using , instead of . as float separator
[['gene', 'col1', 'col2'],
 ['row1', 23,400, 12,400],
 ['row2', 56,800, 34,500],
]

# Numbers as col/row names
[['gene', 1, 2],
 [1, 23.4, 12.4],
 [2, 56.8, 34.5],
]

# Numeric strings as values
[['gene', 'col1', 'col2'],
 ['row1', '23.4', '12.4'],
 ['row2', '56.8', '34.5'],
]
# combined with incorrect float separator
[['gene', 'col1', 'col2'],
 ['row1', '23,400', '12,400'],
 ['row2', '56,800', '34,500'],
]

# Missing a column
[['gene', 'col1', 'col2'],
 ['row1', 23.4],
 ['row2', 56.8],
]

# Extra column
[['gene', 'col1', 'col2'],
 ['row1', 23.4, 12.4, 36.3],
 ['row2', 56.8, 34.5, 23.2],
]

# One row missing a column
[['gene', 'col1', 'col2'],
 ['row1', 23.4],
 ['row2', 56.8, 34.5],
]

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Apr 4, 2019, 11:28

We could support includes with just a bit of code. Syntax would be:

baits: !include baits.yaml
...

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Apr 4, 2019, 11:37

Alternatively to the CLI options, just hand a config file:

baits: !include mybaits.yaml
matrices:
  - !include matrix1.yaml
  - !include matrix2.yaml

or inlined

baits:
  - gene1
  - gene2
matrices:
  - name: matrix1  # must be unique
    data: [['gene', 'col1', 'col2'],
           ['row1', 1, 2]]
  - name: matrix2
    data: [['gene', 'col1', 'col2'],
           ['row1', 1, 2]]

as well as allow other options which are specified on the command line currently.

For the syntax we can simply refer to the YAML docs, the primer which is the short intro.

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Apr 4, 2019, 11:38

CLI is handy too though so keep supporting it but don't test it much, low prio, drop for now if it eats up time.

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Apr 5, 2019, 14:00

changed the description

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Apr 5, 2019, 14:07

Summary todo of all above:

  • Implement the YAML format, dropping support for the tsv format
  • Always treat row/col names as str, always treat values as float. Warn if
    the yaml type doesn't match the expected type, but do continue.
  • Test with old_coexpnetviz_bugs and our own test set

Edit: no longer planning to do these:

  • FAQ: how to convert from xls or old format to YAML

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Apr 5, 2019, 16:34

Warning non-str row/col names should help with detecting the case where the user forgets either the header row or the column with the row names.

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Oct 1, 2020, 23:13

removed milestone

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Oct 2, 2020, 13:58

In coexpnetviz#3 we mentioned supporting just a cytoscape plugin as interface is sufficient, so all we need to worry about is an invalid baits, gene families or expression matrix file. We can pass those as paths to the cli with a json or yaml file as config, which points to the baits, ... files.

  • Write tests for valid/invalid xls, xlsx, ods, csv. Ensure they all show a nice error of some sorts that explains the user sufficiently how to fix the problem.
  • Allow numbers as header, or names which start with a number. I.e. always treat those as str.
  • Allow both integers and floats as expression values
  • Deprecate the tsv format, it's too tricky for users to get right and tricky for us to support all bom, encoding, line endings, ... and all the possible errors such as double tab, spaces instead of tab, ...

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Oct 2, 2020, 16:37

moved from coexpnetviz#7

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Oct 3, 2020, 19:23

For csv files we won't be too lenient, rather we'll show errors with line numbers when we notice any funny business. This means line endings (whitespace) are significant in csv files.

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Oct 3, 2020, 22:27

  • For expmat you'll have to deal with floats formatted as 1.2e-3, 1, -.2e4, 1,000.3e-2. Invalid floats such as 1,3 or 1.000,3.

  • Exp mats must have unique index and cols. Error should name the duplicates.

  • Baits need a robust format too: autodetect encoding, don't mind line endings, simply split() otherwise; meaning we don't support spaces in bait/gene names.

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Oct 3, 2020, 22:52

or maybe baits input will be yaml file only as that needs to be written from scratch anyway, but if we also offer a text field as baits input then we need to support plain text anyway

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Oct 4, 2020, 05:40

  • parse_csv should reference the file being parsed (parse_yaml already does so)
  • Exp mat should also mention its name in errors

from varbio.

timdiels avatar timdiels commented on August 16, 2024

In GitLab by @timdiels on Oct 24, 2020, 18:06

mentioned in commit 8db7a14

from varbio.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.