GithubHelp home page GithubHelp logo

viperior / rimhistory Goto Github PK

View Code? Open in Web Editor NEW
2.0 3.0 0.0 9.23 MB

RimWorld game save data analyzer

License: MIT License

Python 100.00%
data-engineering elt xml pandas data-visualization data-analysis data-science data-profiling simulation-data rimworld

rimhistory's Introduction

rimhistory

RimWorld game save data analyzer

build

Example visualization

Example line chart visualizing flora population by species over time

rimhistory's People

Contributors

dependabot[bot] avatar viperior avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

rimhistory's Issues

Asynchronous File Loading

Use multiprocessing to parallelize the process of loading the XML data into Save objects and converting extracted subsets into pandas DataFrames. This massively improves load times when working with a large number of source files and auto-scales based on the machine's local resources.

Memory usage is kept in check by ensuring excess XML data is deleted at the end of Save.__init__().

Duplicate Pawn Data

After adding support for loading and aggregating data from multiple save files, I discovered what appears to be a data duplication bug for pawn data. I am using a new save file series with more playtime and event history than the original test file. The number of pawns being reported is 26, which suggests possible duplication. The highest number reported from a single save in the series should be around 5 or 6.

Full error details from pytest output:

Run python -m pytest -v -x -n auto
  python -m pytest -v -x -n auto
  shell: /usr/bin/bash -e {0}
  env:
    pythonLocation: /opt/hostedtoolcache/Python/3.10.2/x64
    LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.10.2/x64/lib
============================= test session starts ==============================
platform linux -- Python 3.10.2, pytest-7.0.1, pluggy-1.0.0
rootdir: /home/runner/work/rimhistory/rimhistory, configfile: pytest.ini, testpaths: tests
plugins: xdist-2.5.0, forked-1.4.0
gw0 I / gw1 I
gw0 [14] / gw1 [14]

.........F
=================================== FAILURES ===================================
_____________________________ test_get_pawn_count ______________________________
[gw1] linux -- Python 3.10.2 /opt/hostedtoolcache/Python/3.10.2/x64/bin/python

test_data_list = ['test_data/demosave 1.rws.gz', 'test_data/demosave 3.rws.gz', 'test_data/demosave 2.rws.gz']

    def test_get_pawn_count(test_data_list: list) -> None:
        """Test counting the number of pawns identified from the save data
    
        Parameters:
        test_data_list (list): The list of paths to the test input data files (fixture)
    
        Returns:
        None
        """
        pawn_data = Save(path_to_save_file=test_data_list[0]).data.dataset.pawn.dictionary_list
    
>       assert len(pawn_data) == 3
E       AssertionError: assert 26 == 3
E        +  where 26 = len([{'pawn_ambient_temperature': '28.46966', 'pawn_biological_age': '4', 'pawn_chronological_age': '4', 'pawn_id': 'Thing...8601', 'pawn_biological_age': '100', 'pawn_chronological_age': '100', 'pawn_id': 'Thing_Android2Tier288152', ...}, ...])

tests/test_pawn_data.py:17: AssertionError
=========================== short test summary info ============================
FAILED tests/test_pawn_data.py::test_get_pawn_count - AssertionError: assert ...
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 1 failures !!!!!!!!!!!!!
========================= 1 failed, 9 passed in 18.55s =========================
Error: Process completed with exit code 2.

Branch: feature/async-load
Commit tested: b337a17

I'm guessing pawn data is duplicated in the save file for various reasons. I need to identify the "single source of truth" about colonist-type pawns and formulate the correct XPath patterns to target those. I may also need to select the "current" XML element representing the current, rather than a past, state of the pawn.

Simplify Save Property Namespace

Simplify the property names used in the Save class by completing #21 and removing the then-obsolete dataset level in the data Bunch object. Store the named datasets at the same level as the non-dataset properties.

from save import Save

save = Save(path_to_save_file="saves/save 1.rws")

# Before change
plant_data = save.data.dataset.plant

# After change
plant_data = save.data.plant

Multi-save time series analysis

A RimWorld save file can be thought of as a snapshot of the game state. To conduct time series analysis, multiple snapshots are needed. Now that the ELT design for extracting some of the key datasets is complete, it is time to implement a way to combine data from a series of save files into a single dataset. The time dimension will distinguish each dataset being unioned.

For example, the app can currently load plant data from one save at a time:

import statistics

from rimhistory.save import Save

save_1 = Save("saves/mysave 1.rws")
save_2 = Save("saves/mysave 2.rws")
save_3 = Save("saves/mysave 3.rws")
total_plants_1 = len(save_1.data.dataset.plant.dictionary_list)
total_plants_2 = len(save_2.data.dataset.plant.dictionary_list)
total_plants_3 = len(save_3.data.dataset.plant.dictionary_list)
average = statistics.mean([total_plants_1, total_plants_2, total_plants_3])

print(f"Average living plants over time = {average}")

The SaveSeries class will streamline these operations:

from rimhistory.save import SaveSeries

series = SaveSeries(
    save_dir_path="path/to/saves",
    save_file_regex_pattern=r"mysave\s\d{1,10}"
)
average = len(series.dataset.plant.dataframe.index) / len(series.dictionary)

print(f"Average living plants over time = {average}")

Load Saves from S3

Add a feature that allows rimhistory to load RimWorld save files hosted in an S3 bucket.

New options in config.json:

  • rimworld_save_file_source (str): The source to use to load the save files (local, s3)
  • rimworld_save_file_s3_bucket (str): The name of the S3 bucket to load save files from

The current implementation uses glob and regex to scan a local directory for save files. It does not sniff the XML before loading into a series, although that would be a nice addition to enhance series loading. Instead, it relies on file naming conventions. It ignores autosaves. Rimhistory would work best when paired with a mod that uses templates to name autosaves. I found 2 or 3 such mods available on the Steam Workshop.

To support S3 loading, new functionality must be added to perform the matching against S3 object names in the bucket. After that, the content is loaded as a string into a Save object via SaveSeries.

Minimize Memory Usage

The Save class is storing the datasets as:

  • a list of dictionaries
  • a pandas DataFrame

Reduce memory usage by dropping the list of dictionaries as soon as the pandas DataFrame is created from it. This also allows the Bunch path to be simplified from save.data.dataset.plant.dataframe to save.data.dataset.plant. A list of dictionaries can be re-created as needed, but most operations can access the pandas DataFrame directly.

def Save.__init__():
    # Extract datasets
    # Delete the root object to free up memory
    # Generate pandas DataFrames from each dataset initialized as a list of dictionaries
    <-- Add a step here that deletes each list of dictionaries
    # Apply transformations to DataFrames

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.