mincka / dmarchiver Goto Github PK
View Code? Open in Web Editor NEWA tool to archive the direct messages, images and videos from your private conversations on Twitter
License: GNU General Public License v3.0
A tool to archive the direct messages, images and videos from your private conversations on Twitter
License: GNU General Public License v3.0
version 0.2.0 is not running on my 32 bit os windows
Hi there.
I downloaded the latest Mac release. When I ran the archiver, I noticed it only went back to July 2017 in some of the threads and not all the way to the beginning. One of my largest DM messages is about 22MB txt file once downloaded and this time it was only 1.5MB. I did one screen shot of what seems to be different than what I normally see. Hope this helps in the explanation
As a reminder I'm using the Mac version.
Edited by Mincka on August 10th 2017:
The error message was due in this case to invalid json data. It seemed to be related to a connection issue and it was not possible to reproduce it. Other causes can be found here: https://stackoverflow.com/a/18460958
Original post:
New ticket created from
#1 (comment)
$ /Users/xxx/Downloads/dmarchiver -id "YYY" -di -dg
Enter your username or email: zzz
Enter your password (characters will not be displayed):
Authentication succeedeed.
Conversation ID specified (YYY). Retrieving only one thread.
Starting crawl of 'YYY'
Failed to execute script cmdline
Traceback (most recent call last):
File "dmarchiver/cmdline.py", line 70, in
File "dmarchiver/cmdline.py", line 62, in main
File "dmarchiver/core.py", line 468, in crawl
File "requests/models.py", line 826, in json
File "json/init.py", line 319, in loads
File "json/decoder.py", line 339, in decode
File "json/decoder.py", line 357, in raw_decode
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
When a link is shared and user adds additional text, the added text may not be included in the log.
In the following generated sample, "This is a test." is not included.
<p class="TweetTextSize js-tweet-text tweet-text" lang="" data-aria-label-part="0">How I lost my 25-year battle against corporate claptrap <a href="https://t.co/gIrbtXuRSv" rel="nofollow noopener" dir="ltr" data-expanded-url="https://www.ft.com/lucycolumn" class="twitter-timeline-link" target="_blank" title="https://www.ft.com/lucycolumn" >
<span class="tco-ellipsis"/>
<span class="invisible">https://www.</span>
<span class="js-display-url">ft.com/lucycolumn</span>
<span class="invisible"/>
<span class="tco-ellipsis">
<span class="invisible"> </span>
</span>
</a> This is a test.</p>
This is because cssselect extracts only the text node before the . A workaround could be to use text_content()
:
def _parse_dm_text(self, element):
dm_text = '' text_tweet = element.cssselect("p.tweet-text")[0]
dm_text = text_tweet.text_content()
return DirectMessageText(dm_text)
The output would be:
[2017-08-16 13:37:49] <Julien Ehrhart> [Card-summary_large_image] https://www.ft.com/lucycolumn How I lost my 25-year battle against corporate claptrap https://www.ft.com/lucycolumn This is a test.
Two issues here:
And add a message when the tool starts with this info
my twitter account is being locked eachtime I use dmarchiver
Is it possible to save the content of tweets linked in the dm? such as pictures in the tweet also the video
When I open the exe it asks for my username and password. Then it archives the dms, and closes itself right after finishing.
Hello,
I tried to extract a twitter DM conversation with the macOS version, but I encountered two issues:
My boyfriend died in April. I'm trying to save our twitter DMs... Thanks for any help.
Celine
Neither the argument -di
nor -dg
currently downloads videos. Would be cool to add a -dv
argument.
Split crawler and parser to be able to parse a dump of a conversation.
New Error attempting on Windows,
With or without -di -dg
DMArchiver 0.1.6
Running on Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900
64 bit (AMD64)]Authentication succeedeed.
Conversation ID not specified. Retrieving all the threads.
Traceback (most recent call last):
File "dmarchiver\cmdline.py", line 95, in
File "dmarchiver\cmdline.py", line 86, in main
File "dmarchiver\core.py", line 302, in get_threads
KeyError: 'threads'
Failed to execute script cmdline
DMArchiver 0.1.7
Running on Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600
32 bit (Intel)]Authentication succeedeed.
Press Ctrl+C at anytime to write the current conversation and skip to the next o
ne.
Keep it pressed to exit the script.Conversation ID not specified. Retrieving all the threads.
Traceback (most recent call last):
File "dmarchiver\cmdline.py", line 107, in
File "dmarchiver\cmdline.py", line 96, in main
File "dmarchiver\core.py", line 302, in get_threads
KeyError: 'threads'
Failed to execute script cmdline
From #7:
I think I may have an idea why all the threads have not been downloaded the first time. I've counted all the threads in your previous message and found exactly 50 conversations. Currently, to find "all" the conversation IDs, the script loads the conversations available on the "first" "Messages" page but do not simulate scrolling to load more. I though that all the conversations were listed directly.
My guess is that when you scroll down through all the conversations, at the bottom, Twitter loads the next 50 conversations. I did not identify this case because I have a lot less than 50 conversations on Twitter! But it's an interesting case and I'm going to open a new ticket to improve the "all threads" mode which is in fact a "latest 50 conversations" mode it seems.
Hi, I just used your DMArchiver and now have 143 txt files with no indication of order or tweep. It would be of great help if you could send the output to one file in csv format: Date (ANSI) tweep name text, each conversation separated by an empty line or something like "======" .
For your information:
I got mails from Twitter that someone logged into my account. Which is fine of course.
It didn't work well in W10 outside a command prompt. But also could be the fact that it was then run from a network share.
When I want to search a date in the archive, I can use any text tool to find [YYYY-MM-DD hh:mm:ss]
. However, the images filename use a less intuitive format, converting https://ton.twitter.com/i/ton/data/dm/firsthash/secondhash/thirdhash.jpg
to firsthash-secondhash-thirdhash.jpg
.
Can you add a new option to use an intuitive or "human" format?. Something like this: YYYYMMDD-hhmmss-thirdhash.jpg
. The thirdhash would avoid any collision in the filename. Probably the original format is useful for someone, that's why I'm asking for a new option instead of change the default.
Any chance a option file could be integrated? (Windows)
Possible options:
Username and password for scheduled archives
Browser emulation selection
I am trying this tool out but I get the same error each time:
Previous conversation not found. Creating a new one with incremental support.
An error occured during the parsing of the tweets.
Twitter error details below:
Code 131: Internal error
Hi,
I tried using dmarchiver -di with username password prompted and everything worked fine. However when I am using -u -p arguments to use username password in CLI, I am having following error:
C:\dmarchiver>dmarchiver.exe -u [email protected] -p yyyyyyyy -di
DMArchiver 0.1.7
Running on Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600
32 bit (Intel)]
Traceback (most recent call last):
File "site-packages\requests\packages\urllib3\connectionpool.py", line 595, in
urlopen
File "site-packages\requests\packages\urllib3\connectionpool.py", line 352, in
make_request
File "site-packages\requests\packages\urllib3\connectionpool.py", line 831, in
validate_conn
File "site-packages\requests\packages\urllib3\connection.py", line 289, in con
nect
File "site-packages\requests\packages\urllib3\util\ssl.py", line 308, in ssl
wrap_socket
File "ssl.py", line 362, in wrap_socket
File "ssl.py", line 580, in init
File "ssl.py", line 807, in do_handshake
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c
:600)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "site-packages\requests\adapters.py", line 423, in send
File "site-packages\requests\packages\urllib3\connectionpool.py", line 621, in
urlopen
requests.packages.urllib3.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED]
certificate verify failed (_ssl.c:600)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "dmarchiver\cmdline.py", line 107, in
File "dmarchiver\cmdline.py", line 75, in main
File "dmarchiver\core.py", line 270, in authenticate
File "site-packages\requests\sessions.py", line 488, in get
File "site-packages\requests\sessions.py", line 475, in request
File "site-packages\requests\sessions.py", line 596, in send
File "site-packages\requests\adapters.py", line 497, in send
requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verif
y failed (_ssl.c:600)
Failed to execute script cmdline
Whenever I want to use DMA I have to disable phone verification, it's a minmal risk as I can turn it back on again right after. But I imagine if you'd use it more often than me, it becomes a hassle. And even worse some people might forget to turn it back on or leave it off on purpose because of that.
Now I tried using the login code sent to me on my phone as a password once, and it obviously didn't work. Also the 1 hour temporary app password doesn't work. Do you think you can add support for proper app authentication or look into why the temporaray password doesn't work? And then add a command line switch -pv (phone verification) or something?
After login, DMArchiver crashed several times. Using my @_name eventually worked but then it began crawling from the very beginning.
This is a very long DM conversation (2 years (!) now), so Twitter did not like that and locked my account.
I only needed an incremental update to get the last two weeks of this conversation. Hope this will be possible soon. (Windows 7)
So I've been having trouble to display any of the DM's it's downloaded. I have a Mac if that helps. Anyone, can you help?
This works like a charm but im seeing characters like Полегче.. which means that it dont support other languages... is there any other way to do it?
Conversation ID not specified. Retrieving all the threads.
Traceback (most recent call last):
File "dmarchiver\cmdline.py", line 107, in
File "dmarchiver\cmdline.py", line 96, in main
File "dmarchiver\core.py", line 302, in get_threads
KeyError: 'trusted'
Failed to execute script cmdline
First of all, I want to thank you so much for creating this tool. No one on the internet did this but you. Its my 1 year anniversary with my girlfriend and i really want to retrieve a string of messages because i'm planning to gift her something and it is important that i have all the dms archived. Unfortunately i'm very bad at coding and i couldn't understand the instructions. I think my convo id is 853339611514507267, do i have to type my user name & pass then enter this $ dmarchiver -id "853339611514507267" or is there something i am missing. also where do the messages end up? i do realize how stupid this whole question might be but i'd be grateful if you could assist me. Thank you so much.
From @williammmiller1's idea.
It's currently not possible to make a differential archive based on a previous extraction or a time-based option. Consequently, the script will have to download again a complete thread up to the first message.
Multiple implementations could be done:
When I try to use this tool with my long password it gives me an error "> was unexpected at this time." but when I changed my password to a much shorter one it worked flawlessly
this won't work. once i enter my username and password, the application crashes and doesn't provide any further information.
The version in pypi.python.org is 0.1.3.
Since just a few days downloading the direct messages doesn't work anymore. Something in the HTML mus thave changed on Twitter's side.
I get the error Unknown element type
. In the txt file the date, time and username is there but the actual text message is missing.
This type of message is currently not supported and an exception will be triggered for the message.
Conversation ID not specified. Retrieving all the threads.
Expecting value: line 1 column 1 (char 0)
Are there limits of the number of messages? I successfully tested the script with roughly 13k messages / 1.3mb in one conversation.
The script seems to cache the messages. Would it maybe more scalable if it stored the messages into a file in an incremental fashion instead of caching them?
Hm, I can't get -r to work. I use it with -u and -id.
For a group chat:
['conversation']['title']['raw']
Single user:
'{0} (@{1})'.format(json['conversation']['title']['raw'], json['conversation']['title']['screen_name'])
Hello
When running gives me this error
i think to need set proxy server setting for https connection (in iran https is blocked)
Is it possible to define a new parameter in the command line t set proxy settings?
thnaks
E:\tww>dmarchiver.exe
DMArchiver 0.1.7
Running on Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit (Intel)]
Enter your username or email: zoghal
Enter your password (characters will not be displayed):
Traceback (most recent call last):
File "site-packages\requests\packages\urllib3\connection.py", line 142, in _new_conn
File "site-packages\requests\packages\urllib3\util\connection.py", line 98, in create_connection
File "site-packages\requests\packages\urllib3\util\connection.py", line 88, in create_connection
OSError: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "site-packages\requests\packages\urllib3\connectionpool.py", line 595, in urlopen
File "site-packages\requests\packages\urllib3\connectionpool.py", line 352, in _make_request
File "site-packages\requests\packages\urllib3\connectionpool.py", line 831, in _validate_conn
File "site-packages\requests\packages\urllib3\connection.py", line 254, in connect
File "site-packages\requests\packages\urllib3\connection.py", line 151, in _new_conn
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x02DB23D0>: Failed to establish a new connection: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "site-packages\requests\adapters.py", line 423, in send
File "site-packages\requests\packages\urllib3\connectionpool.py", line 640, in urlopen
File "site-packages\requests\packages\urllib3\util\retry.py", line 287, in increment
requests.packages.urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='twitter.com', port=443): Max retries exceeded with url: /login (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x02DB23D0>: Failed to establish a new connection: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "dmarchiver\cmdline.py", line 107, in <module>
File "dmarchiver\cmdline.py", line 75, in main
File "dmarchiver\core.py", line 270, in authenticate
File "site-packages\requests\sessions.py", line 488, in get
File "site-packages\requests\sessions.py", line 475, in request
File "site-packages\requests\sessions.py", line 596, in send
File "site-packages\requests\adapters.py", line 487, in send
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='twitter.com', port=443): Max retries exceeded with url: /login (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x02DB23D0>: Failed to establish a new connection: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions',))
Failed to execute script cmdline
Hi, twice now when using DMArchiver I've had my Twitter account locked because they think I'm using a robot (which I am, technically) to do something bad (which I'm not).
Both times it's easy to unlock using ReCaptcha or SMS code - I'm assuming it's because of the speed of the requests being made.
Would it be possible to add an optional argument to introduce a delay between requests, to more closely resemble normal browser action?
eg. -td 10 (10 second delay between requests)
Thanks for an excellent program.
Stuff like [Tweet], [Card-summary], [Card-player] and so on is still missing.
:(
The render will allow the display of pictures, GIFs, emojis (system specific or Twitter variations).
Would it be possible to have the program save the @ twitter id of the people I have conversations with, perhaps at the top of the log? It would help me jump to those conversations with searches and id those that deactivate.
I get this every time I try to archive, whether it's just one conversation or all of them. Looked through all the known open/closed issues, didn't see anything like this.
Twitter error code reference
Edited by Mincka on August 10th 2017:
For anybody Googling for this error message XMLSyntaxError: switching encoding: encoder error
:
Possible workarounds:
STATIC_DEPS=true
(lorien/grab#199 (comment)). However, I cannot guarantee this will work. Using multiple Python versions on macOS is such a huge pain. 😞Original message:
My setup:
$ dmarchiver
Enter your username or email: myusername
Enter your password (characters will not be displayed):
Authentication succeedeed.
Conversation ID not specified. Retrieving all the threads.
Starting crawl of '################'
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.5/bin/dmarchiver", line 9, in <module>
load_entry_point('dmarchiver==0.0.5', 'console_scripts', 'dmarchiver')()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/cmdline.py", line 67, in main
crawler.crawl(thread_id, args.download_images, args.download_gifs)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 443, in crawl
tweets, download_images, download_gif)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/dmarchiver/core.py", line 357, in _process_tweets
document = lxml.html.fragment_fromstring(value)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/__init__.py", line 825, in fragment_fromstring
base_url=base_url, **kw)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/__init__.py", line 786, in fragments_fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/lxml/html/__init__.py", line 752, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77737)
File "src/lxml/parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116674)
File "src/lxml/parser.pxi", line 1711, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115220)
File "src/lxml/parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109345)
File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584)
File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238)
File "src/lxml/parser.pxi", line 624, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104147)
lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1
I want to know how the status of the a sent message "sent" vs "seen" can be captured by the script.
Thanks!
Keep context cookies locally to improve the authenticity of the subsequent logins.
https://stackoverflow.com/a/37118451/3049282
i am not able to download images
Hi there! Thanks so much for replying to my message. I found my token ID using the developer tab in Chrome but how do i find the requests to get the DM Message ID once i've clicked on the conversation i'm looking at. Do i do it the same way on same screen i found the token ID? Under elements?
Here is a screen shot of my window. I can switch to safari if that is easier to find it.
Sorry to be such a novice, i really wish i understood all of this. So please excuse my ignorance, but i have a willingness and ability to take direction when it comes to tech stuff.
Thanks,
Ronnie
[email protected]
Here is the developer tool in Safari but i'm not sure where to look for conversation iD for the conversation shown on screen as #TeamErin
Ex. 😂 vs [Face with tears of joy]
in the output
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.