viperior / rimhistory Goto Github PK

View Code? Open in Web Editor NEW

2.0 3.0 0.0 9.23 MB

RimWorld game save data analyzer

License: MIT License

Python 100.00%

data-engineering elt xml pandas data-visualization data-analysis data-science data-profiling simulation-data rimworld

rimhistory's Introduction

rimhistory

RimWorld game save data analyzer

Example visualization

rimhistory's People

Contributors

Stargazers

Watchers

rimhistory's Issues

wcmatch dropped support for Python 3.6

Verify that the Python version used in GitHub workflows is >3.6. Update as needed.

Asynchronous File Loading

Use multiprocessing to parallelize the process of loading the XML data into Save objects and converting extracted subsets into pandas DataFrames. This massively improves load times when working with a large number of source files and auto-scales based on the machine's local resources.

Memory usage is kept in check by ensuring excess XML data is deleted at the end of Save.__init__().

Duplicate Pawn Data

After adding support for loading and aggregating data from multiple save files, I discovered what appears to be a data duplication bug for pawn data. I am using a new save file series with more playtime and event history than the original test file. The number of pawns being reported is 26, which suggests possible duplication. The highest number reported from a single save in the series should be around 5 or 6.

Full error details from pytest output:

Run python -m pytest -v -x -n auto
  python -m pytest -v -x -n auto
  shell: /usr/bin/bash -e {0}
  env:
    pythonLocation: /opt/hostedtoolcache/Python/3.10.2/x64
    LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.10.2/x64/lib
============================= test session starts ==============================
platform linux -- Python 3.10.2, pytest-7.0.1, pluggy-1.0.0
rootdir: /home/runner/work/rimhistory/rimhistory, configfile: pytest.ini, testpaths: tests
plugins: xdist-2.5.0, forked-1.4.0
gw0 I / gw1 I
gw0 [14] / gw1 [14]

.........F
=================================== FAILURES ===================================
_____________________________ test_get_pawn_count ______________________________
[gw1] linux -- Python 3.10.2 /opt/hostedtoolcache/Python/3.10.2/x64/bin/python

test_data_list = ['test_data/demosave 1.rws.gz', 'test_data/demosave 3.rws.gz', 'test_data/demosave 2.rws.gz']

    def test_get_pawn_count(test_data_list: list) -> None:
        """Test counting the number of pawns identified from the save data
    
        Parameters:
        test_data_list (list): The list of paths to the test input data files (fixture)
    
        Returns:
        None
        """
        pawn_data = Save(path_to_save_file=test_data_list[0]).data.dataset.pawn.dictionary_list
    
>       assert len(pawn_data) == 3
E       AssertionError: assert 26 == 3
E        +  where 26 = len([{'pawn_ambient_temperature': '28.46966', 'pawn_biological_age': '4', 'pawn_chronological_age': '4', 'pawn_id': 'Thing...8601', 'pawn_biological_age': '100', 'pawn_chronological_age': '100', 'pawn_id': 'Thing_Android2Tier288152', ...}, ...])

tests/test_pawn_data.py:17: AssertionError
=========================== short test summary info ============================
FAILED tests/test_pawn_data.py::test_get_pawn_count - AssertionError: assert ...
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 1 failures !!!!!!!!!!!!!
========================= 1 failed, 9 passed in 18.55s =========================
Error: Process completed with exit code 2.

Branch: feature/async-load
Commit tested: b337a17

I'm guessing pawn data is duplicated in the save file for various reasons. I need to identify the "single source of truth" about colonist-type pawns and formulate the correct XPath patterns to target those. I may also need to select the "current" XML element representing the current, rather than a past, state of the pawn.

Simplify Save Property Namespace

Simplify the property names used in the Save class by completing #21 and removing the then-obsolete dataset level in the data Bunch object. Store the named datasets at the same level as the non-dataset properties.

from save import Save

save = Save(path_to_save_file="saves/save 1.rws")

# Before change
plant_data = save.data.dataset.plant

# After change
plant_data = save.data.plant

Document Data Model

Document the data model created by rimhistory's ELT design

Multi-save time series analysis

A RimWorld save file can be thought of as a snapshot of the game state. To conduct time series analysis, multiple snapshots are needed. Now that the ELT design for extracting some of the key datasets is complete, it is time to implement a way to combine data from a series of save files into a single dataset. The time dimension will distinguish each dataset being unioned.

For example, the app can currently load plant data from one save at a time:

import statistics

from rimhistory.save import Save

save_1 = Save("saves/mysave 1.rws")
save_2 = Save("saves/mysave 2.rws")
save_3 = Save("saves/mysave 3.rws")
total_plants_1 = len(save_1.data.dataset.plant.dictionary_list)
total_plants_2 = len(save_2.data.dataset.plant.dictionary_list)
total_plants_3 = len(save_3.data.dataset.plant.dictionary_list)
average = statistics.mean([total_plants_1, total_plants_2, total_plants_3])

print(f"Average living plants over time = {average}")

The SaveSeries class will streamline these operations:

from rimhistory.save import SaveSeries

series = SaveSeries(
    save_dir_path="path/to/saves",
    save_file_regex_pattern=r"mysave\s\d{1,10}"
)
average = len(series.dataset.plant.dataframe.index) / len(series.dictionary)

print(f"Average living plants over time = {average}")

Test case observing unexpected gameTicks value

In run https://github.com/viperior/rimhistory/runs/6662782380

The gameTicks value is off, but not by much. Not sure if this is because I forgot to update after changing the test data or if it's a sorting issue. Determine the root cause and fix so build workflow passes.

Load Saves from S3

Add a feature that allows rimhistory to load RimWorld save files hosted in an S3 bucket.

New options in config.json:

rimworld_save_file_source (str): The source to use to load the save files (local, s3)
rimworld_save_file_s3_bucket (str): The name of the S3 bucket to load save files from

The current implementation uses glob and regex to scan a local directory for save files. It does not sniff the XML before loading into a series, although that would be a nice addition to enhance series loading. Instead, it relies on file naming conventions. It ignores autosaves. Rimhistory would work best when paired with a mod that uses templates to name autosaves. I found 2 or 3 such mods available on the Steam Workshop.

To support S3 loading, new functionality must be added to perform the matching against S3 object names in the bucket. After that, the content is loaded as a string into a Save object via SaveSeries.

Minimize Memory Usage

The Save class is storing the datasets as:

a list of dictionaries
a pandas DataFrame

Reduce memory usage by dropping the list of dictionaries as soon as the pandas DataFrame is created from it. This also allows the Bunch path to be simplified from save.data.dataset.plant.dataframe to save.data.dataset.plant. A list of dictionaries can be re-created as needed, but most operations can access the pandas DataFrame directly.

def Save.__init__():
    # Extract datasets
    # Delete the root object to free up memory
    # Generate pandas DataFrames from each dataset initialized as a list of dictionaries
    <-- Add a step here that deletes each list of dictionaries
    # Apply transformations to DataFrames

viperior / rimhistory Goto Github PK

rimhistory's Introduction

rimhistory

Example visualization

rimhistory's People

Contributors

Stargazers

Watchers

rimhistory's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs