eksopl / asagi Goto Github PK
View Code? Open in Web Editor NEWAsagi imageboard dumper
License: Other
Asagi imageboard dumper
License: Other
Basically, implement a save state feature upon closing asagi. It should allow asagi to continue from that save-state to avoid parsing everything all over again.
Detect and mark posts that are deleted from inside threads.
There should be a log containing all error/exceptions that occur.
It seems that moot has added flags on a few boards. It might be nice to have this archived for completeness. Since we have the poster_ip field, it might be easier to populate that field with a dummy IP that is located within the same country. The other option is adding another field but doing that on large boards aren't fun.
Basically, it seems that there may be an issue with purging old threads from the queue. It happens when 4chan goes offline for a few hours and returns. It would resume fetching as usual, but the thread count increases and idles at a new number instead. Since the counter usually idles at 160 threads for most boards, I'm not exactly sure if the old threads are still in the queue or not.
[sp 283 0 0 0 0] 22613441: got HTTP status 502
[a 272 14 0 0 0] 67856285: got HTTP status 502
[v 345 0 0 0 0] 145197791: got HTTP status 502
The dumper died while 4chan was down with the following:
Exception in thread: "Page scanner 1 - mlp" java.lang.IllegalStateException: No match found
at java.util.regex.Matcher.group(Matcher.java:485)
at net.easymodo.asagi.Yotsuba.newYotsubaPost(Yotsuba.java:208)
at net.easymodo.asagi.Yotsuba.parsePost(Yotsuba.java:401)
at net.easymodo.asagi.Yotsuba.getPage(Yotsuba.java:442)
at net.easymodo.asagi.Board.content(Board.java:14)
at net.easymodo.asagi.Dumper$PageScanner.run(Dumper.java:284)
at java.lang.Thread.run(Thread.java:722)
Exception in thread: "Topic fetcher #2 - tg" java.lang.IllegalArgumentException: URLDecoder: Incomplete trailing escape (%) pattern
at java.net.URLDecoder.decode(URLDecoder.java:187)
at net.easymodo.asagi.WWW.doCleanLink(WWW.java:150)
at net.easymodo.asagi.YotsubaJSON.cleanLink(YotsubaJSON.java:230)
at net.easymodo.asagi.YotsubaJSON.makePostFromJson(YotsubaJSON.java:195)
at net.easymodo.asagi.YotsubaJSON.getThread(YotsubaJSON.java:145)
at net.easymodo.asagi.Board.content(Board.java:18)
at net.easymodo.asagi.Dumper$TopicFetcher.run(Dumper.java:442)
at java.lang.Thread.run(Thread.java:722)
Terminating dumper due to unexpected exception.
Please report this issue if you believe it is a bug.
I wasn't able to track down which post contained the trailing % in the email field. However, I suggest one of the following fixes.
if(link.endsWith("%")) link = link + "25";
(forces URLDecoder to make the last character %)
if(link.endsWith("%")) link = link.substring(0, link.length() -1);
(just trims the trailing %)
Exception in thread: "Threadlist fetcher - q" com.google.gson.JsonSyntaxException: java.lang.NumberFormatException: Expected an int but was 4294967295 at line 1 column 1976
at com.google.gson.internal.bind.TypeAdapters$7.read(TypeAdapters.java:232)
at com.google.gson.internal.bind.TypeAdapters$7.read(TypeAdapters.java:222)
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:93)
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:172)
at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.read(TypeAdapterRuntimeTypeWrapper.java:40)
at com.google.gson.internal.bind.ArrayTypeAdapter.read(ArrayTypeAdapter.java:72)
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:93)
at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:172)
at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.read(TypeAdapterRuntimeTypeWrapper.java:40)
at com.google.gson.internal.bind.ArrayTypeAdapter.read(ArrayTypeAdapter.java:72)
at com.google.gson.Gson.fromJson(Gson.java:795)
at com.google.gson.Gson.fromJson(Gson.java:761)
at com.google.gson.Gson.fromJson(Gson.java:710)
at com.google.gson.Gson.fromJson(Gson.java:682)
at net.easymodo.asagi.YotsubaJSON.getAllThreads(YotsubaJSON.java:174)
at net.easymodo.asagi.DumperJSON$BoardPoller.run(DumperJSON.java:46)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.NumberFormatException: Expected an int but was 4294967295 at line 1 column 1976
at com.google.gson.stream.JsonReader.nextInt(JsonReader.java:602)
at com.google.gson.internal.bind.TypeAdapters$7.read(TypeAdapters.java:230)
... 16 more
Read board/dump settings from external file.
It seems that fatal exceptions aren't logged into the debug file when asagi crashes. This makes it hard to find and report issues when asagi is restarted and floods the console.
Every so often, within 10 minutes or so, I get:
Exception in thread: "Topic fetcher #0 - b" java.lang.NullPointerException
at net.easymodo.asagi.YotsubaJSON.getThread(YotsubaJSON.java:144)
at net.easymodo.asagi.AbstractDumper$TopicFetcher.run(AbstractDumper.java:288)
at java.lang.Thread.run(Thread.java:722)
Terminating dumper due to unexpected exception.
Please report this issue if you believe it is a bug.
Like the error reports, I have no dump to post, no possible evidence to support this. Is there something I can do?
I'm running JDK 1.7, if that's any help.
Create DB tables if those don't exist.
To be able to archive location restricted imageboards.
Move the regexes hardcoded in the source code to a separate regex file.
Perhaps with a script that either generates said file from fuuka. Either that, or make fuuka also be able to read the regex definitions from the same file.
[sjis] tags were just added to /jp/; I assume
https://github.com/eksopl/asagi/blob/master/src/main/java/net/easymodo/asagi/YotsubaAbstract.java
should be updated to handle this.
Exception in thread: "Topic fetcher #0 - vg" java.lang.NumberFormatException: For input string: "106a1746"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at net.easymodo.asagi.Yotsuba.parsePost(Yotsuba.java:326)
at net.easymodo.asagi.Yotsuba.getThread(Yotsuba.java:470)
at net.easymodo.asagi.Board.content(Board.java:18)
at net.easymodo.asagi.Dumper$TopicFetcher.run(Dumper.java:435)
at java.lang.Thread.run(Thread.java:722)
Build 19
When asagi fails to parse a post or an entire thread, asagi should save the raw html/json so that you could either fix the parser or fix the html/json manually, and reparse these when restarting asagi.
Same thing should happen when the DB is dead or busy.
It should be pretty easy to do this on this end. The heavy lifting involved is pretty much just writing the triggers and procedures for updating _threads, _images and _daily.
Support in fuuka should also be easy enough, given that part of the work done.
Also, Connector/J is GPL, so it forces asagi to be GPLv2 encumbered. Other dependencies are Apache and BSD, with one LGPL. But since I am considering asagi to be a fuuka derivative, due to it being a completely non-cleanroom reimplementation of fuuka's dumper side, it's already dual GPLv1/Artistic License, so not much of a loss there.
I need to input the JSON without creating the file so there is no risk to expose the asagi.json containing the database info on a badly set up server.
Some kind of support for the HTML Kusaba clones output.
Ugh.
Why won't people just let some things die.
Document the everliving shit out of Asagi's source. It should be good enough to generate Javadoc out of it.
Rather than letting the dumper continue in some morbid state missing threads, we should quit altogether once a thread blows up, so the user can acknowledge the dumper died for some reason, usually the fact that they launched asagi with insufficient heap memory (-Xmx flag).
This subject was discussed earlier as a joke, but it seems like it might be something reasonable.
For Developers:
It would allow us to test the fetcher against HTML code for debugging purposes. We would usually test the fetcher by running it against an entire board to see if it works, but there are always some special cases that we need to test for. This feature would allow us to test against saved HTML code or modified to ensure that we catch all of the bugs/issues/problems. We would never know if the thread we want to test will 404.
For Maintainers:
It is a bit minor, but it would be nice to be able to import original HTML code with the fetcher itself. The community would often be able to provide maintainers with old threads or missing threads saved manually. Therefore, it would be nice if we had an import feature to add the missing information easily.
I would leave the exact implementation up to you, but my suggestion would be some type of "watch" directory within the folder containing asagi. It would be monitored and would parse the entire "watch" directory for threads and import them accordingly.
/home/asagi
`-- watch
`-- 4chan
|-- a
| 1000000.html
| 1000001.html
`-- jp
1000001.html
1000002.html
Mark and purge deleted threads.
This is related to Issue #22.
Since one of asagi's goal is to be extended to other imageboards, there should be a setting to specify the Time Zone of the server being archived. This would ensure that the timestamp stored is accurate.
Also, are we still going to have timestamps all stored in UTC eventually?
Altering images perhaps goes against the ideology of an archive, but perhaps this is something that might help save some serious amounts of storage, almost any .png can be reduced on average a 50%, with quality reduction being near imperceptible.
What if asagi fetched the image, and if it was a png, it compressed it, and then inserted the new hash into the DB for FF? It looks like something that could be done.
http://pngquant.org/ for more info.
I was wondering could it be possible to add the to archived threads?
Finish the regexes that clean the text before inserting.
Batch adding through JDBC / Connector/J needs some extra checks to be on the safe side.
Chmod and chown support for dumped files.
Support I-M-S.
I think it'd be a great feature if there was a search by ID function.
Also great work on the imageboard, it looks very nice!
Need to look into this. Post deletion detection is a bit too shaky for my tastes.
Asagi is inserting the empty string for a few fields in the DB, make it match Fuuka's behavior.
I've seen the issue crop up on /q/ as well with people using things other than asagi, but I think 4chan now requires accept headers, too?
[e 164 0 0 0 0] threads.json: HTTP error: Forbidden (403)
[h 160 0 0 0 0] threads.json: HTTP error: Forbidden (403)
[s 165 0 0 0 0] threads.json: HTTP error: Forbidden (403)
[e 164 0 0 0 0] threads.json: HTTP error: Forbidden (403)
[h 160 0 0 0 0] threads.json: HTTP error: Forbidden (403)
[s 165 0 0 0 0] threads.json: HTTP error: Forbidden (403)
[e 164 0 0 0 0] threads.json: HTTP error: Forbidden (403)
[h 160 0 0 0 0] threads.json: HTTP error: Forbidden (403)
[s 165 0 0 0 0] threads.json: HTTP error: Forbidden (403)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.