richfromm / slack2discord Goto Github PK

View Code? Open in Web Editor NEW

This project forked from thomasloupe/slackord1

8.0 1.0 3.0 188 KB

A Discord client that imports Slack-exported JSON chat history to Discord channel(s).

License: GNU General Public License v3.0

Python 98.32% Shell 1.68%

slack2discord's Introduction

slack2discord

A Discord client that imports Slack-exported JSON chat history to Discord channel(s).

tl;dr

See Script invocation below.

Donations

If you find this software useful and would like to make a financial contribution, you can make a donation via either:

It would be most appreciated, but you are under no obligation to do so.

History

This started out as thomasloupe/Slackord. I made some contributions (see #7 and #9) to add support for threads. But then my list of additional proposed changes was significant enough, that we mutually decided that me continuing development on a hard fork would be better.

By now the code has changed enough, and even uses a totally different approach for communicating with Discord, that there's probably very little of the original codebase remaining. Nevertheless, I owe inspiration to the original, and it helped me significantly in getting started on this effort.

Note that there also exists a .NET version thomasloupe/Slackord2 by the original author, that contains additional functionality and appears to be more actively maintained than the upstream fork from which this Python project originates.

Prereqs

py3

This assumes Python 3.x. (Actually, currently 3.9+, see mypy notes below.) Python 2.x was EOL'd at the beginning of 2020, and no new project should be using it.

virtualenv

Install the required packages into a Python virtualenv via:

pip install -r requirements.txt

For help creating virtual environments, see the venv docs. If you use Python a lot, you may also want to consider virtualenvwrapper. If you don't want to think much about virtual envs and just want simple Python scripts to work, you could consider pyv.

Usage

Slack export

To export your Slack data as JSON files, see the article at https://slack.com/help/articles/201658943-Export-your-workspace-data

Note that only workspace owners/admins and org owners/admins can use this feature. While this is available on all plans, only public channels are included if you have the Free or Pro version. You need a Business+ or Enterprise Grid plan to export private channels and direct messages (DMs).

I also think (but am not 100% positive) that the export has the same 90 day limit of history imposed on Free plans as of 1 September 2022. (This change was what motivated me to migrate from Slack to Discord and work on this tool.)

If it is a large export, this might take a while. Slack will notify you when it is done.

The export is in the form of a zip file. Download it, and unzip it.

The contents include dirs at the top level, one for each channel. Within each dir is one or more JSON files, of the form YYYY-MM-DD.json, one for each day in which there are messages for that channel. Like so:

channel1
 |- 2022-01-01.json
 |- 2022-01-02.json
 |- ...
channel2
 |- 2022-01-01.json
 |- 2022-01-02.json
 |- ...
...

There is additionally some metadata contained in JSON files at the top level, but they are not (currently) used by this script, and are not shown above.

Discord import

One time setup

For complete instructions, see https://discordpy.readthedocs.io/en/stable/discord.html

Login to the Discord website and go to the Applications page:

https://discordapp.com/developers/applications/
Create a new application. See the implications below (in creating a bot) of choosing an application name that contains the phrase "discord" (like "slack2discord").

New Application -> Name -> Create
Optionally enter a description. For example:

Applications -> Settings -> General Information -> Description

A Discord client that imports Slack-exported JSON chat history to Discord channel(s).

Save Changes
Create a bot

Applications -> Settings -> Bot -> Build-A-Bot -> Add Bot -> Yes, do it!
Unfortunately, if you named your App "slack2discord", Discord disallows the use of the phrase "discord" within a username. So the Username prefix (before the number) will default to "slack2". If you don't like that, you can choose something else:

Applications -> Settings -> Bot -> Build-A-Bot -> Username

like "slack2disc0rd".
Create a token:

Applications -> Settings -> Bot -> Build-A-Bot -> Token -> Reset Token -> Yes, do it!

Copy the token now, as it will not be shown again. This is used below.

(Don't worry if you mess this up, you can always just repeat this step to create a new token.)
Invite the bot to your Discord server:

Applications -> Settings -> OAuth2 -> URL Generator -> Scopes: check "bot"

Bot permissions -> General permissions:

check: "Manage Channels"

Bot permissions -> Text permissions:

additionally check: "Send Messages", "Create Public Threads", "Send Messages in Threads"

This will create a URL that you can use to add the bot to your server.
- Go to Generated URL
- Copy the URL
- Paste into your browser
- Login if requested
- Select your Discord server, and authorize the external application to access your Discord account, confirming the above permissions.
- -> Continue -> Authorize
- Do the Captcha if requested
- Close the browser tab

Discord token

The Discord token (created for your bot above) must be specified in one of the following manners. This is the order that is searched:

On the command line with --token TOKEN
Via the DISCORD_TOKEN env var
Via a .discord_token file placed in the same dir as the slack2discord.py script.

Script invocation

Briefly, the script is executed via:

./slack2discord.py [--token TOKEN] [--server SERVER] [--no-create] \
    [--users-file USERS_FILE] [--downloads-dir DOWNLOADS_DIR] [--ignore-file-not-found] \
    [-v | --verbose] [-n | --dry-run] <src-and-dest-related-options>

The src and dest related options can be specified in one of three different ways:

--src-file SRC_FILE --dest-channel DEST_CHANNEL

This is for importing a single file from a Slack export, that corresponds to a single day of a single channel.
--src-dir SRC_DIR [--dest-channel DEST_CHANNEL]

This is for importing all of the days from a single channel in a Slack export, one file per day.
--src-dirtree SRC_DIRTREE [--channel-file CHANNEL_FILE]

This is for importing all of the days from multiple (potentially all) channels in a Slack export. One dir per channel, and within each channel dir, one file per day.

For more details, and complete descriptions of all command line options, execute:

./slack2discord.py --help

File attachments

As noted at the end of Slack: How to read Slack data exports, files uploaded to Slack are not directly part of a Slack export. Instead, the export contains URLs that point to the locations of the files on Slack servers. These URLs include a token that gives you the ability access those files. These tokens are listed along with the exports on your Slack workspace's export page, which can be found at https://<workspace-name>.slack.com/services/export

When running the script, such files will first be downloaded from Slack to a local dir (which can be controlled with the --downloads-dir option), and then uploaded to Discord. There are a number of reasons why this operation could fail, and the files listed in the Slack export might not be found. These include:

The file was deleted from Slack after the export was performed
The download token associated with that file export was revoked (this can be done via the export page for the Slack workspace)

In these cases (and for any other HTTP errors related to downloading file attachments from Slack), the default behavior is for the script to fail by raising an HTTPError. This allows you to investigate the situation before deciding how to proceed.

For the special case of HTTP Not Found errors, you can override this behavior, and simply log the not found file as a warning, with the command line option --ignore-file-not-found.

Note that if any files are deleted before the export is created, this state will be reflected within the export (the file will have its mode set to tombstone). Any such files will always be logged as warnings and ignored, regardless of whether or not the ignore option is set for the HTTP Not found case.

Internals

The Discord Python API uses asyncio, so there is the potential to speed up the overall execution time by having multiple Discord HTTP API calls execute in parallel. I have intentionally chosen to not do this.

Within a single channel, we want all of the messages in the channel (and all of the messages within a thread) to be posted in order of timestamp, so that is a reason to serialize those.

A better argument could be made for parallelizing posting to multiple channels. I decided that, at least for the time being, it would be far easier to reason about errors (and potentially restart a failed script, although no such restart support is currently included) if only one channel was imported at a time.

Libraries

This code uses the following libraries:

discord.py (docs, pypi, source) (yes, there really is a .py suffix included in the package name)
decorator (docs, pypi, source)
requests (docs, pypi, source)
tqdm (docs, pypi, source)

External docs

Development

If you want to work on development of this library, besides the packages previously installed (see Prereqs above), you should additionally install the required dev packages into your virtualenv:

pip install -r requirements-dev.txt

This will allow you to run automated tests via pytest:

pytest

As well as mypy static typing checks:

mypy slack2discord.py

I'm using the lower case typing notation for collections (which also removes the need for an import), which requires Python 3.9+ . In the future I may switch to using the | operator rather than Union, which would require Python 3.10+ .

Additionally you can run flake8 style checks:

flake8 slack2discord.py slack2discord

You can automatically run all checks with ./check.sh. GitHub is configured (via .github/workflows/check.yaml and GitHub Actions) to automatically run these checks on any PRs, to require the checks to pass before merging to master, and to automatically run these checks for any pushes to master.

Future work

Some items I am considering:

Better error reporting, so that if an entire import is not successful, it is easier to resume in a way as to avoid duplicates.
Add more automated tests
Ways to optimize file downloads:
- Download multiple files asynchronously via using aiohttp
- Stream file downloads in chunks via Response.iter_content

Feel free to open issues in GitHub if there are any other features you would like to see.

slack2discord's People

Contributors

Stargazers

Watchers

Forkers

waocats ajroberts0417 bearcat83

slack2discord's Issues

Creating channels should be true by default

First of all, thank you very much for this project. I just used it to migrate a Slack workspace yesterday and it would have been a real pain, if not outright impossible, if I didn't have this. While migrating, I ran into a few issues from which I have a few requests. To be able to discuss different issues separately, I've opened a few issues. I can help to create a PR with some of the changes if you agree with them.

The issue

I moved a Slack workspace to a completely fresh Discord server and imagine most people using this program would do the same. The first time I ran the program, I ran into a problem when the script tried to post messages because I didn't supply the --create parameter. I think this parameter should be true by default or at the very least also documented more clearly on this repository's README.

test issue

created a test issue per slack request

Deal with discord file size limit

According to various sources online (many complaining about this), the max file size on a free plan is 8 MB. (I have not yet tried to verify this.) For paid plans, the currently posted limits at https://discord.com/nitro are either 50 MB or 500 MB. Online sources were inconsistent, some saying 50 MB, some saying 100 MB. But the bottom line is that there is no one single fixed limit, and it could be subject to change.

Unlike the case of the message text character limit (#29), I don't think we just want to err on the side of the lower fixed limit for free accounts. In that case, the consequences are minimal, we just might split messages that didn't really need it (or split messages more than necessary). But in this case, there's no good fallback, we just can't upload the file if we're over the limit. So we don't want to have a false positive.

Which means that rather than try to anticipate this in advance, we should instead detect and then handle the error. I will have to do some testing to see precisely how this manifests itself at the HTTP level. (I fear that right now we'll just end up in an endless retry loop, see #25)

If we can't upload a file, we should log an error and then continue. I'll have to think about whether that's all we do. See the long discussion at #20 (comment), where I was debating whether for the case of a file deleted from slack if we should upload some placeholder to discord indicating a file that ought to be there but is missing. There is potentially a stronger case for doing this here, since we know various info about the file (its name, possibly its type, its size). Still, I'm not sure if the minimal added value of an empty placeholder is actually worth the trouble.

Deal with discord channel limits

While testing channel creation for #24 (and somewhat for #23 too), I came across yet another unexpected limit:

2023-02-08 11:51:26 ERROR    asyncio Task exception was never retrieved
future: <Task finished name='Task-5' coro=<MyClient.my_background_task() done, defined at /Users/rich/projects/slack2discord/poc/./background_task_asyncio.py:57> exception=HTTPException('400 Bad Request (error code: 50035): Invalid Form Body\nIn parent_id: Maximum number of channels in category reached (50)')>
Traceback (most recent call last):
  File "/Users/rich/projects/slack2discord/poc/./background_task_asyncio.py", line 69, in my_background_task
    await self.create_channels()
  File "/Users/rich/projects/slack2discord/poc/./background_task_asyncio.py", line 346, in create_channels
    await self.create_text_channel(guild, channel_name)
  File "/Users/rich/projects/slack2discord/poc/./background_task_asyncio.py", line 294, in create_text_channel
    channel = await guild.create_text_channel(channel_name, category=text_channels_category)
  File "/Users/rich/.virtualenvs/slack2discord/lib/python3.10/site-packages/discord/guild.py", line 1331, in create_text_channel
    data = await self._create_channel(
  File "/Users/rich/.virtualenvs/slack2discord/lib/python3.10/site-packages/discord/http.py", line 744, in request
    raise HTTPException(response, data)
discord.errors.HTTPException: 400 Bad Request (error code: 50035): Invalid Form Body
In parent_id: Maximum number of channels in category reached (50)

Googling a bit, I think the following limits exist and are relevant:

Limit of 50 channels per category
Limit of 500 channels per server

There's another limit that's worth noting here, but I don't think it's probably directly relevant:

Limit of 50 categories per server (probably not relevant)

Regarding categories, the current behavior when creating any new channels is to look for a category with the name 'Text Channels'. This is chosen because it is the default when creating a new Discord, it's where #general is located. But users (or maybe just admins?) can create new categories, and rename existing ones, so there's no guarantee that there will be such a category. If it is not found, we place any newly created channels in no category.

I verified that channels can be created with no category, and they are not subject to the 50 channel per category limit. Presumably the only limit is the 500 channels per server, but I admittedly haven't tested that yet.

In some ways the existing behavior is okay, but should perhaps consider a few modifications:

Rather than just hardcode 'Text Channels', perhaps have that be the default, but allow overriding with a --category option.
But what if the specified category can't be found? I think if it's a user inputted category, that's probably an error, not that there should be some fallback.
Although another option would be to allow for the automatic creation of the category, just like we allow for the automatic creation of channels.
But if we're going to allow for specification of category, do we also allow for specification of no category?

Sigh, already this is starting to get too complicated...

Let me move on to the more important points I wanted to make:

If we are going to put newly created channels in a category, we should check in advance to make sure that we're not going hit the 50 channel per category limit.
If we are going to hit the limit, then we should either fail, or create the channels with no category. I'm undecided about which, and/or to what extent this should be controllable.

So we could skip adding category features, keep it mostly the way it is, and just put new channels in 'Text Channels' if it exists and there's room, or in no category if either that category doesn't exist, or if it does exist but doesn't have space for all of the new channels.

Regarding the 500 channel limit, we should also check that in advance, making sure that the the number of existing channels plus the number of channels that we are going to create doesn't put us over the limit. If it does, we have little choice but failing gracefully (ideally before changing any state in Discord), with a sufficiently instructive error message. Options include: the user can consolidate/delete channels from Slack, not import everything, import multiple Slack channels into a single Discord channel.

Deal with discord character limit on message text

#25 complains about a situation in which a failure that is never going to pass gets stuck in a retry loop forever.

I'll address the retry situation in that issue, but the more pressing concern IMHO is that discord has character limits that I wasn't aware of, and that I think need to be dealt with. (Via either truncation, splitting up, or perhaps some combination of the two.)

In the case reported, the limit was 4000. Most sources I've seen say there's a 2000 character limit per message, but I think you can get higher limits if you pay? But sadly it's not just message contents that might be the problem, there could be issues with embeds as well.

Some refs:
https://www.itgeared.com/what-is-the-character-limit-on-discord/
https://discord.com/developers/docs/topics/opcodes-and-status-codes
https://www.google.com/search?q=discord+maximum+message+length&oq=discord+maximum+mes&aqs=chrome.0.0i512j69i57j0i22i30j0i390l4.4002j0j7&sourceid=chrome&ie=UTF-8
https://www.integromat.com/en/help/how-to-split-and-post-several-messages-without-exceeding-a-certain-character-limit-e-g-discord
https://discord.com/developers/docs/resources/webhook#execute-webhook-jsonform-params
https://www.reddit.com/r/discordapp/comments/lfawsu/why_is_the_discord_message_character_limit_is_2000/

Attn: @shmulvad

Open to donations?

Hi Rich. Not sure what the best avenue to go about communicating with you is and I know @thomasloupe was instrumental in this project. I just wanted to let you know that I used slack2discord (with some modifications) to upload ~15k messages from a Slack export to Discord. I'm very appreciative of your time working on this project and would love to contribute a small donation to you for your work. It wouldn't be much but I'd like to show some appreciation. Thanks again for putting time into this.

Parsing error of channel names with multiple dashes

One of my Slack channels was named something similar to string1---string2. This got parsed as string1-string2 when creating the channel, but as string1---string2 when trying to post messages to the channel (which then failed because a channel with that name hadn't yet been created). This should be fixed so it is consistent.

Posting messages that raise errors will make the script loop forever

When posting the messages, I would get errors for a few of them such as

Caught HTTP exception sending message to channel: 400 Bad Request (error code: 50035): Invalid Form Body
In content: Must be 4000 or fewer in length.

Of course, trying to repost this will keep on giving the same error as the underlying message didn't change. This means the script will loop forever until manually stopped by the user.

I think you should provide an option like --max-retry-count , so if the same message keeps failing more than, say, 3 times, it will get skipped and perhaps logging all occurrences of this type to a log file. I had to manually change this part of the source code to not have the program get stuck in the same part every time I ran it.

parser is missing messages posted by bots

We are searching for messages (see SlackParser.parse_file() in parser.py with:

if 'user_profile' in message and 'ts' in message and 'text' in message:

then getting the name based on:

                    real_name = message['user_profile']['real_name']

The problem is that bots don't have user profiles.

A bot has a bot_id, but that's probably irrelevant.

Note that all messages have the following:

type
user
text
ts

This comes from https://slack.com/help/articles/220556107-How-to-read-Slack-data-exports#how-to-read-messages

So what we probably ought to do at the highest level is look for messages just based on type being set to message.

If there is a user_profile, that's great, we can use it, although in retrospect, it might be better for us to post the messages to Discord using display_name rather than real_name (both of these are within the user_profile).

But back to the bug... Note that we can also get the info for real users from users.json (in the export), but that only includes real users, not bots, so it doesn't really help us.

Normal user ID's seem to be of the form Uxxxxxxxx, where the x's are alphanumeric characters.

The only bot example I have is for user USLACKBOT. My proposal is that if there is no user_profile, then look to user. Strip a leading U off of the user if present, and use the rest of the string. If it happens to not start with a U, then just use the entire string.

And maybe as a final fallback, if there is no user (even though the docs claim there always is), log a warning, and use the string ???

I have no clue if perhaps the slackbot is special, and maybe other user-defined bots actually are in users.json. I would need more data for that, the slack documentation isn't sufficient.

Deal with discord embed limits

I am limiting #29 (which is the underlying issue that caused the filing of #25) to deal with just the Discord limit on the number of characters within the text of a message, which is 2000 for free users.

But according to https://www.itgeared.com/what-is-the-character-limit-on-discord/ , there are other issues related to embed's, and this issue is for dealing with them. For our purposes, embed's originate as links within Slack messages, which Slack calls "attachments". (What I call actual attachments, Slack calls "files".)

I think we should deal with the following character limits:

256 characters for title
4096 characters for description

Unlike the case of exceeding the limit on actual message text, I think in these cases it's probably fine to truncate and add something like ... to indicate the truncation.

There are some other character limits described, but I'm not sure if all of them are applicable for us, and/or worth worrying about. But I'm not entirely positive. Like do the URLs count as part of the 6000 character total limit?

There is one other limit that we are already accounting for, which is a max of 10 embed's per message. Currently we're dealing with this by just truncating the list. Which is a little lame, although since the links still actually appear within the text (just not the previews), maybe it's okay? I will consider dealing with this as part of this issue, or maybe not and/or break out as another issue. One possible way is to create one or more blank messages just for the purpose of the extra embed's, although that has the same caveats in terms of dealing with threading as expressed at the end of #29 (comment)

Attn: @shmulvad

Deal with discord active thread limit

While researching #32, I came across yet another unexpected limit.

https://support.discord.com/hc/en-us/community/posts/360056762431-Increase-channel-limit?page=2#community_comment_4416922099351 claims:

1: Yes, there is a limit of 1000 ACTIVE threads, but you can have an unlimited amount of archived threads.

I have not seen any documentation of this (although plenty of other limits are also undocumented), nor have I tried to replicate it.

Perhaps we should explicitly archive threads when creating them during the import? Or maybe only if they are sufficiently old? And/or maybe only if we are in danger of hitting the limit ? (See Guild.active_threads(), I think we'd need to fetch all of the active threads and count how many there are.)

Note that I'm not entirely sure how to archive a thread via the API. I don't see any kind of archive() method on the Thread class. There is an archived attribute, it's not clear whether that's read-only, or if I can archive a thread just by setting that.

Another related attribute is auto_archive_duration, although again it's not clear if that can be changed by just setting that. Note that the Message.create_thread() method used to create the thread (called on the message that's the root of the thread) does have a parameter auto_archive_duration. (If not provided, a channel default value is used.) But this is measured in minutes. If we're close to the limit, even setting it to 1 minute might not be sufficient unless we substantially artificially slow down the importing of data to Discord. I'd have to test if 0 is a legal value and if that would force the thread to be immediately archived on creation. But even if that worked, would it persist? That is, after creating the thread, would each new message added to the thread in the import cause it to be unarchived? If so, would that be okay if a setting of 0 then caused it to be immediately re-archived?

The bottom line is that some further investigation and testing is required.

Allow reusing already downloaded images

The bot spends a very long time downloading images because it is done sequentially and on every run. As you note, this could be sped up significantly by downloading them in parallel.

Another thing that also could be done though is to reuse downloaded images if the program has already been run. I tried to manually set the download directory, but all it accomplished was that it still downloaded images, just overwriting the previously downloaded images with exact copies. I think if the image is already there, it should be reused - especially if the user explicitly provides the download directory flag.

Finally, adding something like tqdm during the image downloading would make the script more user friendly.