GithubHelp home page GithubHelp logo

mincka / dmarchiver Goto Github PK

View Code? Open in Web Editor NEW
222.0 22.0 25.0 155 KB

A tool to archive the direct messages, images and videos from your private conversations on Twitter

License: GNU General Public License v3.0

Python 100.00%
twitter conversation direct-message archive tweets downloader backup dm

dmarchiver's People

Contributors

cajuncooks avatar dependabot-preview[bot] avatar dependabot[bot] avatar mincka avatar trwnh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dmarchiver's Issues

Error when using the lastest release - not all messages downloaded

Hi there.
I downloaded the latest Mac release. When I ran the archiver, I noticed it only went back to July 2017 in some of the threads and not all the way to the beginning. One of my largest DM messages is about 22MB txt file once downloaded and this time it was only 1.5MB. I did one screen shot of what seems to be different than what I normally see. Hope this helps in the explanation
As a reminder I'm using the Mac version.

Thanks
Ronnie
screenshot 2017-10-06 19 25 34
screenshot 2017-10-06 19 56 01
screenshot 2017-10-06 19 55 37

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Edited by Mincka on August 10th 2017:

The error message was due in this case to invalid json data. It seemed to be related to a connection issue and it was not possible to reproduce it. Other causes can be found here: https://stackoverflow.com/a/18460958

Original post:
New ticket created from
#1 (comment)

$ /Users/xxx/Downloads/dmarchiver -id "YYY" -di -dg

Enter your username or email: zzz

Enter your password (characters will not be displayed):

Authentication succeedeed.

Conversation ID specified (YYY). Retrieving only one thread.

Starting crawl of 'YYY'

Failed to execute script cmdline

Traceback (most recent call last):

File "dmarchiver/cmdline.py", line 70, in

File "dmarchiver/cmdline.py", line 62, in main

File "dmarchiver/core.py", line 468, in crawl

File "requests/models.py", line 826, in json

File "json/init.py", line 319, in loads

File "json/decoder.py", line 339, in decode

File "json/decoder.py", line 357, in raw_decode

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Text added to cards may be incomplete

When a link is shared and user adds additional text, the added text may not be included in the log.

In the following generated sample, "This is a test." is not included.

  <p class="TweetTextSize  js-tweet-text tweet-text" lang="" data-aria-label-part="0">How I lost my 25-year battle against corporate claptrap <a href="https://t.co/gIrbtXuRSv" rel="nofollow noopener" dir="ltr" data-expanded-url="https://www.ft.com/lucycolumn" class="twitter-timeline-link" target="_blank" title="https://www.ft.com/lucycolumn" >
        <span class="tco-ellipsis"/>
        <span class="invisible">https://www.</span>
        <span class="js-display-url">ft.com/lucycolumn</span>
        <span class="invisible"/>
        <span class="tco-ellipsis">
            <span class="invisible">&nbsp;</span>
        </span>
    </a> This is a test.</p>

This is because cssselect extracts only the text node before the . A workaround could be to use text_content():

def _parse_dm_text(self, element):
    dm_text = '' text_tweet = element.cssselect("p.tweet-text")[0]
    dm_text = text_tweet.text_content()
    return DirectMessageText(dm_text)

The output would be:
[2017-08-16 13:37:49] <Julien Ehrhart> [Card-summary_large_image] https://www.ft.com/lucycolumn How I lost my 25-year battle against corporate claptrap https://www.ft.com/lucycolumn This is a test.

Two issues here:

  1. The link appears twice (once during the parsing of the card, once during the parsing of the text) -> Acceptable
  2. The emojis are not in the text so they are stripped from the output -> Not acceptable

Handle errors in requests (locked account)

Connection from new IP addresses, with new browser (Firefox user-agent) or after invalid authentications may trigger an account block. The script will be unable to parse the request and will return error messages such as "KeyError: 'threads' or "KeyError: 'inner'

image

Conversation not extracted

Hello,
I tried to extract a twitter DM conversation with the macOS version, but I encountered two issues:

  1. the screen says "Process completed", I can find the .txt files in Finder but the conversation I need has not been extracted (I maybe have two dozens conversations, but I only need that one)
  2. the extracted conversations look like this: no DM visible. Did I do something wrong?
    screen shot 2017-06-25 at 15 35 46

My boyfriend died in April. I'm trying to save our twitter DMs... Thanks for any help.
Celine

Download videos

Neither the argument -di nor -dg currently downloads videos. Would be cool to add a -dv argument.

KeyError: Threads, Failed to execute script cmdline

New Error attempting on Windows,
With or without -di -dg

DMArchiver 0.1.6
Running on Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900
64 bit (AMD64)]

Authentication succeedeed.

Conversation ID not specified. Retrieving all the threads.
Traceback (most recent call last):
File "dmarchiver\cmdline.py", line 95, in
File "dmarchiver\cmdline.py", line 86, in main
File "dmarchiver\core.py", line 302, in get_threads
KeyError: 'threads'
Failed to execute script cmdline

DMArchiver 0.1.7
Running on Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600
32 bit (Intel)]

Authentication succeedeed.

Press Ctrl+C at anytime to write the current conversation and skip to the next o
ne.
Keep it pressed to exit the script.

Conversation ID not specified. Retrieving all the threads.
Traceback (most recent call last):
File "dmarchiver\cmdline.py", line 107, in
File "dmarchiver\cmdline.py", line 96, in main
File "dmarchiver\core.py", line 302, in get_threads
KeyError: 'threads'
Failed to execute script cmdline

Only 50 latest conversations are downloaded in "All conversations" mode

From #7:

I think I may have an idea why all the threads have not been downloaded the first time. I've counted all the threads in your previous message and found exactly 50 conversations. Currently, to find "all" the conversation IDs, the script loads the conversations available on the "first" "Messages" page but do not simulate scrolling to load more. I though that all the conversations were listed directly.

My guess is that when you scroll down through all the conversations, at the bottom, Twitter loads the next 50 conversations. I did not identify this case because I have a lot less than 50 conversations on Twitter! But it's an interesting case and I'm going to open a new ticket to improve the "all threads" mode which is in fact a "latest 50 conversations" mode it seems.

Request: csv output

Hi, I just used your DMArchiver and now have 143 txt files with no indication of order or tweep. It would be of great help if you could send the output to one file in csv format: Date (ANSI) tweep name text, each conversation separated by an empty line or something like "======" .
For your information:
I got mails from Twitter that someone logged into my account. Which is fine of course.
It didn't work well in W10 outside a command prompt. But also could be the fact that it was then run from a network share.

Add an new option to add the date to the images filename

When I want to search a date in the archive, I can use any text tool to find [YYYY-MM-DD hh:mm:ss]. However, the images filename use a less intuitive format, converting https://ton.twitter.com/i/ton/data/dm/firsthash/secondhash/thirdhash.jpg to firsthash-secondhash-thirdhash.jpg.

Can you add a new option to use an intuitive or "human" format?. Something like this: YYYYMMDD-hhmmss-thirdhash.jpg. The thirdhash would avoid any collision in the filename. Probably the original format is useful for someone, that's why I'm asking for a new option instead of change the default.

Option file

Any chance a option file could be integrated? (Windows)

Possible options:
Username and password for scheduled archives
Browser emulation selection

Code 131: Internal error

I am trying this tool out but I get the same error each time:

Previous conversation not found. Creating a new one with incremental support.
An error occured during the parsing of the tweets.

Twitter error details below:
Code 131: Internal error

Certificate Verify Fail when using -u -p switch

Hi,

I tried using dmarchiver -di with username password prompted and everything worked fine. However when I am using -u -p arguments to use username password in CLI, I am having following error:

C:\dmarchiver>dmarchiver.exe -u [email protected] -p yyyyyyyy -di
DMArchiver 0.1.7
Running on Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600
32 bit (Intel)]

Traceback (most recent call last):
File "site-packages\requests\packages\urllib3\connectionpool.py", line 595, in
urlopen
File "site-packages\requests\packages\urllib3\connectionpool.py", line 352, in
make_request
File "site-packages\requests\packages\urllib3\connectionpool.py", line 831, in
validate_conn
File "site-packages\requests\packages\urllib3\connection.py", line 289, in con
nect
File "site-packages\requests\packages\urllib3\util\ssl
.py", line 308, in ssl

wrap_socket
File "ssl.py", line 362, in wrap_socket
File "ssl.py", line 580, in init
File "ssl.py", line 807, in do_handshake
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c
:600)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "site-packages\requests\adapters.py", line 423, in send
File "site-packages\requests\packages\urllib3\connectionpool.py", line 621, in
urlopen
requests.packages.urllib3.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED]
certificate verify failed (_ssl.c:600)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "dmarchiver\cmdline.py", line 107, in
File "dmarchiver\cmdline.py", line 75, in main
File "dmarchiver\core.py", line 270, in authenticate
File "site-packages\requests\sessions.py", line 488, in get
File "site-packages\requests\sessions.py", line 475, in request
File "site-packages\requests\sessions.py", line 596, in send
File "site-packages\requests\adapters.py", line 497, in send
requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verif
y failed (_ssl.c:600)
Failed to execute script cmdline

Getting DMArchiver to work with phone verification and login codes

Whenever I want to use DMA I have to disable phone verification, it's a minmal risk as I can turn it back on again right after. But I imagine if you'd use it more often than me, it becomes a hassle. And even worse some people might forget to turn it back on or leave it off on purpose because of that.

Now I tried using the login code sent to me on my phone as a password once, and it obviously didn't work. Also the 1 hour temporary app password doesn't work. Do you think you can add support for proper app authentication or look into why the temporaray password doesn't work? And then add a command line switch -pv (phone verification) or something?

Crashes after login

After login, DMArchiver crashed several times. Using my @_name eventually worked but then it began crawling from the very beginning.
This is a very long DM conversation (2 years (!) now), so Twitter did not like that and locked my account.

I only needed an incremental update to get the last two weeks of this conversation. Hope this will be possible soon. (Windows 7)

Not Finding Files Of DM's

So I've been having trouble to display any of the DM's it's downloaded. I have a Mac if that helps. Anyone, can you help?

KeyError: 'trusted'

Conversation ID not specified. Retrieving all the threads.
Traceback (most recent call last):
File "dmarchiver\cmdline.py", line 107, in
File "dmarchiver\cmdline.py", line 96, in main
File "dmarchiver\core.py", line 302, in get_threads
KeyError: 'trusted'
Failed to execute script cmdline

Where do you find the output files on macOS?

First of all, I want to thank you so much for creating this tool. No one on the internet did this but you. Its my 1 year anniversary with my girlfriend and i really want to retrieve a string of messages because i'm planning to gift her something and it is important that i have all the dms archived. Unfortunately i'm very bad at coding and i couldn't understand the instructions. I think my convo id is 853339611514507267, do i have to type my user name & pass then enter this $ dmarchiver -id "853339611514507267" or is there something i am missing. also where do the messages end up? i do realize how stupid this whole question might be but i'd be grateful if you could assist me. Thank you so much.

Allow differential backup to complete a previous backup

From @williammmiller1's idea.

It's currently not possible to make a differential archive based on a previous extraction or a time-based option. Consequently, the script will have to download again a complete thread up to the first message.

Multiple implementations could be done:

  1. Allow the possibility to specify a previous backup to complete only the delta since the last message
  2. Allow the possibility to specify min / max tweet IDs with the arguments
  3. Allow the possibility to specify min / max date with the arguments

Intermittent timeout errors

this won't work. once i enter my username and password, the application crashes and doesn't provide any further information.

DMArchiver stopped working: "Unknown element type"

Since just a few days downloading the direct messages doesn't work anymore. Something in the HTML mus thave changed on Twitter's side.

I get the error Unknown element type. In the txt file the date, time and username is there but the actual text message is missing.

Scalability (# of messages)?

Are there limits of the number of messages? I successfully tested the script with roughly 13k messages / 1.3mb in one conversation.

The script seems to cache the messages. Would it maybe more scalable if it stored the messages into a file in an incremental fashion instead of caching them?

how can to set proxy setting?

Hello
When running gives me this error
i think to need set proxy server setting for https connection (in iran https is blocked)
Is it possible to define a new parameter in the command line t set proxy settings?
thnaks

E:\tww>dmarchiver.exe
DMArchiver 0.1.7
Running on Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit (Intel)]

Enter your username or email: zoghal
Enter your password (characters will not be displayed):
Traceback (most recent call last):
  File "site-packages\requests\packages\urllib3\connection.py", line 142, in _new_conn
  File "site-packages\requests\packages\urllib3\util\connection.py", line 98, in create_connection
  File "site-packages\requests\packages\urllib3\util\connection.py", line 88, in create_connection
OSError: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "site-packages\requests\packages\urllib3\connectionpool.py", line 595, in urlopen
  File "site-packages\requests\packages\urllib3\connectionpool.py", line 352, in _make_request
  File "site-packages\requests\packages\urllib3\connectionpool.py", line 831, in _validate_conn
  File "site-packages\requests\packages\urllib3\connection.py", line 254, in connect
  File "site-packages\requests\packages\urllib3\connection.py", line 151, in _new_conn
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x02DB23D0>: Failed to establish a new connection: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "site-packages\requests\adapters.py", line 423, in send
  File "site-packages\requests\packages\urllib3\connectionpool.py", line 640, in urlopen
  File "site-packages\requests\packages\urllib3\util\retry.py", line 287, in increment
requests.packages.urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='twitter.com', port=443): Max retries exceeded with url: /login (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x02DB23D0>: Failed to establish a new connection: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "dmarchiver\cmdline.py", line 107, in <module>
  File "dmarchiver\cmdline.py", line 75, in main
  File "dmarchiver\core.py", line 270, in authenticate
  File "site-packages\requests\sessions.py", line 488, in get
  File "site-packages\requests\sessions.py", line 475, in request
  File "site-packages\requests\sessions.py", line 596, in send
  File "site-packages\requests\adapters.py", line 487, in send
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='twitter.com', port=443): Max retries exceeded with url: /login (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x02DB23D0>: Failed to establish a new connection: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions',))
Failed to execute script cmdline

Twitter account being locked due to suspected Robot

Hi, twice now when using DMArchiver I've had my Twitter account locked because they think I'm using a robot (which I am, technically) to do something bad (which I'm not).

Both times it's easy to unlock using ReCaptcha or SMS code - I'm assuming it's because of the speed of the requests being made.

Would it be possible to add an optional argument to introduce a delay between requests, to more closely resemble normal browser action?

eg. -td 10 (10 second delay between requests)

Thanks for an excellent program.

Request: Save twitter user names?

Would it be possible to have the program save the @ twitter id of the people I have conversations with, perhaps at the top of the log? It would help me jump to those conversations with searches and id those that deactivate.

XMLSyntaxError: switching encoding: encoder error

Edited by Mincka on August 10th 2017:
For anybody Googling for this error message XMLSyntaxError: switching encoding: encoder error:

  • It may be related to the parsing in lxml of emojis or specific ranges of Unicode characters (like 𝜋) which are four-byte characters
  • The issue is specific to macOS and Python 3.5
  • A ticket for a bug is opened but nobody seems to be working on it (https://bugs.launchpad.net/lxml/+bug/1538213)

Possible workarounds:

  1. Strip the emojis on macOS before the parsing, see this implementation in 073a358
  2. Downgrade to Python 3.4 if you can. I attempted to upgrade to Python 3.6 but had other compatibility issues, this time with pyinstaller, so I was unable to move forward. Downgrade to Python 3.4 allow my tool to work perfectly on all platforms.
  3. Remove lxml package and reinstall it using STATIC_DEPS=true (lorien/grab#199 (comment)). However, I cannot guarantee this will work. Using multiple Python versions on macOS is such a huge pain. 😞

Original message:
My setup:

  • Python 3.5.2
  • macOS Sierra 10.12
$ dmarchiver
Enter your username or email: myusername
Enter your password (characters will not be displayed): 
Authentication succeedeed.
Conversation ID not specified. Retrieving all the threads.
Starting crawl of '################'
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/bin/dmarchiver", line 9, in <module>
    load_entry_point('dmarchiver==0.0.5', 'console_scripts', 'dmarchiver')()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/cmdline.py", line 67, in main
    crawler.crawl(thread_id, args.download_images, args.download_gifs)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 443, in crawl
    tweets, download_images, download_gif)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 357, in _process_tweets
    document = lxml.html.fragment_fromstring(value)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/__init__.py", line 825, in fragment_fromstring
    base_url=base_url, **kw)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/__init__.py", line 786, in fragments_fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/__init__.py", line 752, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737)
  File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674)
  File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220)
  File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345)
  File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584)
  File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238)
  File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147)
lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1

How do i find the requests in the Twitter DM developer page

Hi there! Thanks so much for replying to my message. I found my token ID using the developer tab in Chrome but how do i find the requests to get the DM Message ID once i've clicked on the conversation i'm looking at. Do i do it the same way on same screen i found the token ID? Under elements?
Here is a screen shot of my window. I can switch to safari if that is easier to find it.
screenshot 2016-10-23 15 04 33

Sorry to be such a novice, i really wish i understood all of this. So please excuse my ignorance, but i have a willingness and ability to take direction when it comes to tech stuff.

Thanks,
Ronnie
[email protected]
Here is the developer tool in Safari but i'm not sure where to look for conversation iD for the conversation shown on screen as #TeamErin

screenshot 2016-10-23 15 27 20

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.