GithubHelp home page GithubHelp logo

d2txt's People

Contributors

pastelmind avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

germayneng

d2txt's Issues

Add pathlib.Path support

Currently, methods that accept file objects or paths (e.g. D2TXT.load_txt()) do not support pathlib.Path objects, because they directly check if the path is a string with isinstance(inifile, str). Add support for Path objects, by trying open() first and catching a TypeError.

Use `dict` for column name lookup

As of writing, D2TXTRow stores values in a list ordered by column order. It searches a list of column names to determine the appropriate index when retrieving a cell by column name. This is inefficient (O(N)).

Solution: Use dict to speed up random access to each cell by column name.

There are two ways of doing this. One way is to use a master dict held by D2TXT that maps column names to indices. Another way is to make each D2TXTRow store values in a dict, and keep a master list held by D2TXT that remembers the order of column names.

  • Note: In Big-O notations, M = number of rows, N = number of columns.
  • Note: Column operations are not supported yet, but is a long-term goal.

Pros and cons of master dict, row list approach

  • Smaller memory footprint
  • Faster row creation
  • Faster sequential access when iterating a row -- O(1) in theory
    • Due to how collections.abc.Mapping is implemented (source code), sequential access requires a key lookup, which increases time cost to O(1) average, O(N) amortized worst case
  • Extra indirection (O(1)) when retrieving a cell by column name (master dict โ†’ row list)
  • Slower column insertion/deletion -- O(MN)
    • Also requires updating the master dict, which increases complexity
  • Slower column swap -- O(M)

Pros and cons of master list, row dict approach

  • Larger memory footprint
  • Slower row creation, since a new dict must be built each time
  • Slower sequential access when iterating a row -- O(1) average, O(N) amortized worst case
    • May be improved using OrderedDict, but this is incompatible with column insertion/deletion
  • Cell access by column name is immediate
  • Faster column insertion/deletion -- O(M) average, O(MN) amortized worst case
  • Faster column swap -- O(1)

Decision

Use master dict, row list approach. Row creations and iterations (including loading and saving TXT and INI files) are frequent, while column operations are relatively infrequent. Extra indirection when accessing cells by column name is acceptable.

To facilitate ease of use, D2TXT.__setitem__() should accept both sequences and mappings. A sequence is treated as a list of cell values ordered by column name, and can be used to quickly create a row. A mapping is treated as a collection of column name-cell value pairs, and can be used to intuitively create a row. Use isinstance() with Abstract Base Classes to check if the given object is a sequence or a mapping.

References

Move all tool config into pyproject.toml

The goal is to reduce the number of config files as much as possible.

Group related keys and values in TOML

Many columns in TXT files are closely related to each other and often used together. When decompiling a TXT file to TOML, grouping these columns into a single key would improve readability and reduce both the number of lines and the overall file size. These columns shall be grouped when decompiling to TOML, and un-grouped when compiled back to TXT.

Each column group has a unique group alias. When a D2TXT is decompiled, columns that belong to a column group are not directly written to the TOML file. Instead, their values are combined into a single line, using the group alias (which does not overlap with existing keys in the TXT file) as the key.

The details of column grouping rules, syntax, and the (mostly) full list of columns grouped have been moved to the wiki.

Change [Columns] section format

The Columns section of INI files generated by d2ini.py is used to specify the order of columns in TXT files. Currently, it merely lists each column name in order with empty values:

[Columns]
column name 1=
column name 2=
...

This is ugly. It also depends on ConfigParser preserving the order of keys (see Customizing Parser Behaviour section of ConfigParser docs for more info).

Using column numbers as keys and names as values is more intuitive:

[Columns]
1=column name 1
2=column name 2
...

However, this requires extra code for parsing and sorting the COLUMNS section. It also creates an ambiguous situation when a column number is missing:

; column 2 is deleted by user, intentionally or by accident
[Columns]
1=column name 1
3=column name 3
...

If a column number is missing, should D2INI raise an exception, create an empty column, or ignore it?

Enabling the allow_no_value option in ConfigParser would prevent such ambiguities, in addition to eliminating useless equals signs:

[Columns]
column name 1
column name 2
...

TypeError: to_txt() missing 1 required positional argument: 'txtfile'

Hi Pastel, when applying:
i've written a simple test script calling the skills file from my D2R install.

skills = D2TXT.load_txt('skills.txt')

#Plauge poppy max spawn
skills[222]['petmax'] = "1"

D2TXT.to_txt('result/skills.txt')

but when i compile i get TypeError on every single argument that can be called from your class.

is this only my issue?

TOML vs YAML

Continuing the discussion from #3 (comment), which format should we switch to?

Factors to Consider

  1. Less room for mistakes
  2. Comments
  3. Support for lists/arrays: would make #3 easy to implement
  4. If possible, allow adding/moving/deleting rows without affecting other rows.
    • Currently, decompiled INI files use their row indices as column names. This means that when rows are added, moved, deleted, many columns have to be renamed, even though their contents haven't changed.
    • This can be worked around by selecting a "primary key" column for each TXT file as the column name. However, it would require deciphering the type of each TXT file. Also, some of the base TXT files have erroneous duplicate IDs (e.g. MonStats.txt), which prevent us from saving them to INI.

Comparison of Markup Formats

JSON

JSON does not support comments. This alone makes it undesireable.

TOML

TOML is very similar to INI files. It supports inline arrays. It also supports arrays of tables, which solves item 4.

TOML requires all string values to be quoted. This makes the output slightly more verbose, but also leaves less room for mistakes.

YAML

YAML supports inline sequences (flow style). However, PyYAML follows special rules to decide whether to emit block or flow style. I don't know if I can (or need to) modify this rule.

Item 4 can be solved by storing each rows in a sequence.

YAML is better than TOML for deeply nested structures. However, Diablo 2's TXT files (tab-separated values) are mostly flat, which means YAML's advantage is moot.

YAML also supports unquoted string literals.

YAML has several gotchas. Notably, it has multiple confusing variants of multi-line string syntax.

Other Formats

I know little about other markup formats. I don't want to use an obscure format.

Add support for slice syntax in D2TXT

As of writing, D2TXT.__setitem__() does not support the slice syntax:

# This will raise an exception or corrupt your table!
d2txt[1:1] = [{'column 1': 'value 1', 'column 2': 'value 2'}]

This is because it does not check whether the given key is actually a slice object. For more information, see the docs on the slice() function, as well as a section on slice objects in the Data model docs.

To add support for slice syntax, a type-check for the slice object is necessary, plus some code for inserting them:

# NOT TESTED
def __setitem__(self, key, value):
    if isinstance(key, slice):
        self._rows[key] = [D2TXTRow(self, row) for row in value]
    else:
        self._rows[key] = D2TXTRow(self, value)

Note that D2TXT.__getitem__() and D2TXT.__delitem__() do (unintentionally) support the slice syntax, because they delegate the key (which is either a number or a slice object) to the internal list of rows:

def __getitem__(self, key):
    return self._rows[key]

Of course, this returns a list of rows still bound to the original D2TXT object. Should it return a D2TXT object instead? Maybe.

Shallow Copy vs Deep Copy vs Just Use a List

A shallow copy on slice would return a table whose rows are still bound to the original D2TXT. This may be confusing, since it creates two seemingly distinct table objects that actually refer to the same table. Also, column operations would be tricky to implement. A column operation on one table must propagate to all rows on every other table.

A deep copy on slice would return a clone of the table. Since column operations on one table does not affect the other, there are no headaches. However, since built-in collections return a shallow copy on slice, cloning D2TXT on slice might feel inconsistent.

Just returning a list (current behavior) solves most of the headaches. You don't get a new table, or a "view" of the old table; just a list of rows still linked to the original table. Practicality beats consistency.

Compile from/decompile to TOML instead of INI

Based on the findings in #9, add support for conversion to and from TOML. Also drop support for INI files, which rely on fragile, home-grown syntax.

TOML is better than INI in many aspects. It is more strict and leaves less room for mistakes. It brings its own string escaping rules, so that I don't have to make my own. It supports complex structures like nested lists, so that I don't have to reinvent obscure syntaxes (see #3).

I shall use toml and qtoml. Both are small packages (~90 KiB each, not counting upstream dependencies), so pulling both as dependencies sounds OK.

Inherit D2TXTRow from collections.abc.Mapping

D2TXTRow should inherit from [collections.abc.Mapping], not collections.abc.Sequence. The initial decision was affected by the fact that each row is internally stored as a list. However, each cell is usually accessed using column name rather than column index. Thus, it is more intuitive to treat D2TXTRow as a mapping.

D2TXTRow does not inherit from collections.abc.MutableMapping yet, because I do not intend to add support for adding/deleting entire columns in the near future. Even if column operations are added later, they are expensive, and should be exposed through methods on D2TXT. D2TXTRow must separately implement __setitem__() that only allows direct assignments on existing keys (column names).

Previously, D2TXTRow.__getitem__() and D2TXTRow.__setitem__() accepted both column names (str) and indices (int) as keys. However, this required type()-checking the key. It could also create ambiguous situations when a column uses a number as its name ("1", "2", etc.). To prevent such ambiguities, these methods should no longer accept column indices as keys.

Note that ordered iterations are still possible using D2TXTRow.values() and D2TXTRow.items(). Due to how collections.abc.Mapping is implemented (source code), these operations require a key lookup. Thus, each sequential access has a time cost of O(1) average, O(N) amortized worst case (compared to always O(1) when inheriting from collections.abc.Sequence).

Support adding new rows

As of writing, D2TXT does not provide explicit support for adding new rows. In particular, D2TXT.__setitem__() accepts a list (same as internal representation), despite D2TXTRow behaving like a mapping for practical purposes.

D2TXT should accept most insert operations supported by list, and accept dict objects when inserting new rows:

d2txt.insert({}, i)
d2txt.insert(row_dict, j)
d2txt.append({})
d2txt.append(row_dict)

If row_dict contains a key that does not match any column names in the D2TXT object, raise a KeyError.

Revert to a single-module package

Many small packages are contained in a single module (file). d2txt.py and d2ini.py are small, so combining them into a single module may be ok.

For more information, see the Python Modules section of the An Overview of Packaging for Python from the Python Packaging User Guide.

Pros

  • Other mod makers (most likely unfamiliar with Python) can just download and use d2txt.py from GitHub

Cons

  • File may become too bulky
  • Not scalable (there are long-term plans to add support for AnimData.d2)

Decision

Use d2txt.py and d2ini.py for the moment. If I ever add support for other file formats, I can always add a new script.

Implement Column Group Tables

Introduction

Column Groups were proposed in #3 and implemented in #14. In doing so, I discovered that several types of columns are not suitable for describing with TOML arrays.

For example, several item-related TXT files have multiple fields with names like modXcode, modXparam modXmin, modXmax, where X is a positive integer. Such fields are clearly meant to be edited as one subgroup, so it is desirable to place each on a single line. Using TOML arrays, they can be grouped like this:

[[items]]
id = 1
--mod1 = ['damage', '100', '200']
--mod2 = ['fire-damage', '50', '60']
--mod3 = ['swing', '20', '20']

Note that the member columns have heterogeneous types--modXcode is a string, while other fields are numbers. Since arrays in TOML v0.5.0 spec disallow heterogeneous types, integer fields must be encoded as strings, or use the "nested array trick":

--mod1 = [['damage'], [100], [200]]

This is ugly.

Since each value in this group has a different meaning, placing them in a single array--a data structure meant to store multiple entries of the same type--is fundamentally awkward.

In addition, one has to memorize the order of each value within the group. I attempted to solve this by providing descriptive column group aliases:

--mod1-CodeMinMax = [['damage'], [100], [200]]

But this is just as ugly and hard to decipher.

Inline Tables

This type of data is suited for nested dictionaries. In TOML, such data structures can be described compactly using inline tables:

[[items]]
id = 1
mod1 = { code = 'damage', min = 100, max = 200 }
mod2 = { code = 'fire-damage', min = 50,  max = 60 }
mod3 = { code = 'swing', min = 20, max = 20 }

There are two other possible formats, both more verbose than inline tables. Dotted child tables:

[[items]]
id = 1

[items.mod1]
code = 'damage'
min = 100
max = 200

[items.mod2]
code = 'fire-damage'
min = 50
max = 60

[items.mod3]
code = 'swing'
min = 20
max = 20

...and dotted keys:

[[items]]
id = 1
mod1.code = 'damage'
mod1.min = 100
mod1.max = 200
mod2.code = 'fire-damage'
mod2.min = 50
mod2.max = 60
mod3.code = 'swing'
mod3.min = 20
mod3max = 20

These forms, are more verbose than tables without column groups. Inline tables are clearly better for the job.

Design

Column Group Tables

Keys for column group tables are prefixed with two underscores (__). This keeps them visually distinct from column group arrays, which are prefixed with two dashes (--).

Example:

__rArm = { left = 5, right = 10, top = 20, bottom = 25 }

Since each subkey describes the purpose of each value, the key (alias) should be short. Less than 12 characters is good.

Column group tables should ideally have between 2 and 6 values. Each value should have a distinct meaning (cf. etype1, etype2, etype3, ...).

Column Group Metadata

The column_groups table at the top of each TOML file describes the column groups used in the file. Each key-value pair describes a column group array: the key is the column group alias, and the value is an array of member columns.

This format can be extended to describe column group tables as well:

[[column_groups]]
 # Metadata for column group arrays
--Mod1-MinMax = ['mod1code', 'mod1min', 'mod1max']
--Mod2-MinMax = ['mod2code', 'mod2min', 'mod2max']
--Mod3-MinMax = ['mod3code', 'mod3min', 'mod3max']
 # Metadata for column group tables
--Mod1 = { code = 'mod1code', min = 'mod1min', max = 'mod1max' }
--Mod2 = { code = 'mod2code', min = 'mod2min', max = 'mod2max' }
--Mod3 = { code = 'mod3code', min = 'mod3min', max = 'mod3max' }

Subkeys such as code, min, and max are henceforth referred to as column member aliases, or member aliases for short.

The metadata maps member aliases to member column names to keep consistent with how the inline tables are used in rows. It also allows us to parse TOML slightly faster, since this metadata can be converted to mappings of member aliases to column names.

Implementation

Both toml and qtoml can parse inline tables. Unfortunately, neither supports generating inline tables via public APIs.

I skimmed the source code of toml v0.10.0 and qtoml 0.3.0. Both packages do support generating inline tables from dictionaries, though not easily.

Hacking toml

toml generates an inline table if a [toml.TomlEncoder] initialized with preserve=True is used and the object is an instance of [toml.decoder.InlineTableDict]. This class is a direct subclass of [object] and therefore cannot be used as a mapping. However, [toml.TomlDecoder] provides the [get_empty_inline_table()] method, which returns a an InlineTableDict instance that also subclasses [dict]. Easy peasy. Too bad we don't use toml.dumps()--it's bugged.

Hacking qtoml

qtoml generates an inline table for an object if qtoml.encoder.TOMLEncoder.is_scalar() returns True. Normally, this only occurs if a dictionary is inside a list inside another list. I tried to circumvent this by creating a custom dict subclass, which I added to qtoml.TOMLEncoder.st as the key with TOMLEncoder.dump_itable as the value.

from qtoml.encoder import TOMLEncoder

class MyInlineDict(dict):
    pass

hacked_encoder = TOMLEncoder()
hacked_encoder.st[MyInlineDict] = hacked_encoder.dump_itable

This made the encoder generate inline tables for instances of my custom class. Unfortunately, it was also generating nested tables for the same data:

[[items]]
mod1 = { code = 'damage', min = 100, max = 200 }
mod2 = { code = 'damage', min = 150, max = 250 }

[items.mod1]
code = 'damage'
min = 100
max = 200

[items.mod2]
code = 'damage'
min = 150
max = 250

This is because the current implementation of TOMLEncoder.dump_sections() always renders dict instances as individual sections. Thus, we need a custom mapping class that is not a subclass of dict.

Fortunately, the standard library already provides such a class: collections.UserDict. It acts just like a dict, except that isinstance(o, dict) returns False.

Final solution:

from qtoml.encoder import TOMLEncoder
from collections import UserDict

hacked_encoder = TOMLEncoder()
hacked_encoder.st[UserDict] = hacked_encoder.dump_itable

# Use hacked_encoder.dump_sections(o, [], True) to dump TOML

Use Python warnings when de-duplicating column names

Python has explicit library support for warnings. Let's use them where appropriate.

Potential uses of warnings:

  • When a duplicate column name is automatically renamed
  • .When an invalid command line argument or option is encountered, and argparse fails to deal with it

Installation instructions unclear; how to make sure shell can find d2txt?

I followed the installation instructions:

  1. Install Python 3.9.
  2. Run pip install d2txt in PowerShell.
  3. Restart PowerShell

The command d2txt is not available. PowerShell and CMD don't recognize it as a command. (Neither does the Python interactive environment.)

Do I need further steps to add d2txt to my path?

Thanks in advance! :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.