GithubHelp home page GithubHelp logo

jocketf / fimfarchive Goto Github PK

View Code? Open in Web Editor NEW
23.0 4.0 0.0 338 KB

Preserves stories from Fimfiction

Home Page: https://www.fimfiction.net/user/116950/

License: GNU General Public License v3.0

Python 99.59% HTML 0.29% Dockerfile 0.12%
mylittlepony archiving fanfiction fimfiction pony python

fimfarchive's Introduction

Fimfarchive

Fimfarchive aims to release all stories on Fimfiction as a single ZIP-file. The archive contains not only stories, but also metadata such as tags, ratings, and descriptions. It is organized by author and could be used for backup, offline reading, or data mining.

Releases can be found on Fimfarchive's user profile at Fimfiction. Note that this is not an official Fimfiction project, so do not send questions to Fimfiction staff. Instead, send a private message or post a comment to the Fimfarchive user profile.

A new version will be released each season via BitTorrent, approximately once every three months. When suitable, an xdelta3 patch will also be provided for users who do not wish to redownload unchanged stories.

Note that the archive contains a large number of files. Unzipping it to your file system may not be necessary if the archive is to be used together with some application. If you are a developer, reading directly from the ZIP-file may be preferable.

This repository contains code for updating and building the archive. While the API is not guaranteed to be stable, it can also be used as a library for easy access to stories and metadata within the archive. A Fimfiction API key is however needed to stories directly from Fimfiction.

Installation

There are primarily two ways to install this tool. The first is installation as a library for use within other projects, and the second is installation for development of Fimfarchive. Using a virtual environment is recommended for both cases in order to avoid contaminating the rest of the Python installation.

Installation as a Library

Make sure a virtual environment has been created and activated. When done, simply install the library directly from the master branch on GitHub.

python3 -m pip install git+https://github.com/JockeTF/fimfarchive.git

Optionally also install lz4 to lower the memory footprint of open archives.

python3 -m pip install lz4

That's it! Import a class to make sure things work as expected.

from fimfarchive.fetchers import FimfarchiveFetcher

Installation for Development

Start by creating a clone of the Fimfarchive repository.

git clone https://github.com/JockeTF/fimfarchive.git

Enter the cloned repository and create a virtual environment called venv within it. Make sure to activate the virtual environment before proceeding to install the development dependencies.

python3 -m pip install -r requirements.txt

Optionally also install lz4 to lower the memory footprint of open archives.

python3 -m pip install lz4

All done! Run the test suite to make sure everything works as expected.

pytest

Running

Fimfarchive has a command line interface which is invoked as a Python module. It can't do much except prepare new Fimfarchie releases. For archive browsing you will need to use third-party tools, or make your own.

$ python3 -m fimfarchive
Usage: COMMAND [PARAMETERS]

Fimfarchive, ensuring that history is preseved.

Commands:
  build   Builds a new Fimfarchive release.
  update  Updates stories for Fimfarchive.

The command line interface features multiple subcommands, each with its own brief help text. The subcommand is specified as the second program argument.

$ python3 -m fimfarchive update --help
usage: [-h] [--alpha] --archive PATH [--refetch]

Updates stories for Fimfarchive.

optional arguments:
  -h, --help      show this help message and exit
  --alpha         fetch from Fimfiction APIv1
  --archive PATH  previous version of the archive
  --refetch       refetch all available stories

Some commands (such as update) require a Fimfiction API key. The program reads this key from the environment variable FIMFICTION_ACCESS_TOKEN. Any data downloaded from Fimfiction is stored in the current working directory, typically in the worktree subdirectory. The same thing goes for rendered stories, built archives, or anything else related to the release process.

Process

The process for building a new Fimfarchive release consists of a few simple steps. Before starting, make sure you have the previous version of Fimfarchive nearby, as well as a Fimfiction APIv2 key. Also, remove any previous worktree directory from the current working directory. Some of the commands mentioned below are currently only available in feature branches.

  • Update: Invoke the update subcommand to refresh all stories. This takes about one month since all story metadata has to be traversed. Story data isn't downloaded unless changes have been made since the last release. Use the --refetch flag if all data should be updated regardless of if there have been any changes. Write down the Started and Done dates for later.

  • Render: Use the render subcommand to generate EPUB-files for all stories with updated content. The subcommand requires ebook-convert from Calibre to be installed and accessible from the command line. Fimfarchive will usually keep the CPU maxed out for a few hours during this step.

  • Count: The count subcommand compares the upcoming release with the previous one. The output mainly consists of statistics for the changelog.

  • Document: Update the documentation in docs/readme.tex for the upcoming release. Change the document title, add a row to the changelog table, and a new changelog subsection. Render the document a few times with lualatex and place the results in worktree/extras as readme.pdf.

  • About: Create an about.json file in worktree/extras. The file has three keys named version, start, and end. Each key has a simple date string like 20201201 as its value. Preferably use the file included with the previous release as a template to keep things consistent.

  • Build: Create a build directory in worktree, and then run the build subcommand. Expect this to take up to 15 minutes depending on the machine. The resulting archive will be written to the build directory.

  • Verify: Go through the archive to check that everything looks good. One tip is to test the CRC checksums of both the outer ZIP-archive and internal EPUB-files. Sample some old and new stories to check that they look right. Successfully opening the archive with Fimfareader can help prove that the metadata has all of the required fields with the correct data types.

  • Patch: Create an xdelta3 patch if applicable. It's important to allow xdelta3 to use a lot of memory since it otherwise has trouble seeing the similarities between the archives. For example, xdelta3 -B 2147483648 -e -s <old> <new> <patch> uses the maximum allowed value of 2 GiB.

  • Torrent: Create a torrent file if applicable. Using a private tracker with a whitelist is preferable since public ones could be flaky or have poor response times. However, it's usually a good idea to include a few public trackers as well to improve availability. Set the chunk size so that the torrent is split into somewhere between 1000 and 2000 pieces. Values outside that range could cause performance issues or prevent the torrent from being easily distributed via magnet links.

  • Release: Upload, announce, and distribute the release!

fimfarchive's People

Contributors

jocketf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

fimfarchive's Issues

How to run?

Could you please add some documentation as to how to get this up and running? All the commands from cloning the repo (I already figured that part out :P but I figure it's good to include for completeness) to creating the final zip.

Status visible, old, and deleted fics

Hello,
I noticed that every fic in the index.json has 'status': 'visible', even the ones that have been deleted or that are not accessible publicly.
The submitted and published fields are also always true as far as I can tell.

But the data for some of those stories is pretty inconsistent with the rest of the archive. I think most of those are deleted stories, but I have no way to exclude them since every story is marked visible and published in the archive.
Some of the problems with not being able to exclude deleted stories:

  • Stories on the site all have a 'series' tag (e.g. MLP-FiM or EQG), but there's ~45k fics in the archive that don't have one. Many of those seem to be old deleted stories, but there's no reliable way to know.
    • Note that there are also non-deleted stories that have inconsistent tags! Story 31718 has the MLP:FiM tag on the site, but not in the archive
  • The non-story data won't be up to date: the author object will be full of NULL values on some stories, but not others (even though the author's account is still active).
  • It makes it harder to use fimfarchive as a data source in general. For example I saw the search GUI that works offline, but if I wanted something like this as a webpage that links to the real site, I'd need a way to filter dead links.

So, is it intended that status, submitted and published are always truthy?
Is there a way to filter out deleted fics that I missed, and is it normal that some of the non-deleted fics' tags don't match what's on the site?

I made a browser-based search for fimfarchive

Hello!

I wanted to let you know that I made a browser-based search for fimfarchive. It's not a full-text search, but it is content-based. It relies on text embeddings produced by the SPLADE v2 neural network.

You can find it at https://a0346f102085fe9f.github.io/IAS2/

It can be hosted locally too.

You can add it to the "third party projects" of the fimfarchive page on fimfiction if you see fit.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.