icij / extract Goto Github PK
View Code? Open in Web Editor NEWA cross-platform command line tool for parallelised content extraction and analysis.
License: MIT License
A cross-platform command line tool for parallelised content extraction and analysis.
License: MIT License
docu says
extract spew -d /path/to/files -r redis -o file --file-output-directory /path/to/text
but:
export JAVA_OPTS='-Xms1024m -Xmx10240m'
extract spew -d /path/to/files -r redis -o file --file-output-directory /path/to/text
gives:
myname@extract:~/extract$ ./target/extract-capsule.x spew -d /tmp/input -r redis -o file --file-output-directory /tmp/output
Jul 29, 2016 11:52:45 AM org.icij.extract.cli.Main main
SEVERE: Failed to parse command line arguments: Unrecognized option: -d
I guess I have to look at the working directory and pattern options instead.
I tried to build but got stuck in a test - not that severe, I think - but I cant find any built application, so it is 'that' severe.
log (mvn install -e -X > log 2>&1) attached.
This will also facilitate setting the content length on the embed metadata, as doing this without the saving feature will require every embed to be spooled to disk by TikaInputStream.
The authors have not taken any care to mention any steps / specs for a successful build and no guidance in this area.
Ubuntu server https://www.ubuntu.com/download/server - Ubuntu 18.10
download ISO - http://releases.ubuntu.com/18.10/ubuntu-18.10-live-server-amd64.iso
VMware Workstation 14 Pro
install Ubuntu / login as user / check current dir
environment details and pre-installation commands
ls
sudo add-apt-repository ppa:webupd8team/java
sudo apt update
sudo apt install oracle-java8-installer
javac -version
sudo apt install oracle-java8-set-default
mvn
apt-cache search maven
sudo apt-get install maven
mvn
mvn -version
sudo apt update
sudo apt install tesseract-ocr
git
ls
git clone https://github.com/ICIJ/extract
ls
cd extract/
ls
NOTE: open the pom.xml in the extract folder in a text editor and modify as shown below
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-gpg-plugin</artifactId>
<version>1.5</version>
<executions>
<execution>
<id>sign-artifacts</id>
<phase>verify</phase>
<goals>
<goal>sign</goal>
</goals>
</execution>
</executions>
<configuration>
<skip>true</skip>
</configuration>
</plugin>
NOTE 2: go to the dir /home/userx/extract/extract-cli/ and open the pom.xml file and modify as below
, you need to add this line
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>1.7.25</version>
</dependency>
download the slf4j-simple-1.7.25.jar and put it in /home/user/.m2/repository/org/slf4j/slf4j-api/1.7.25/
folder
mvn install -DskipTests -Dgpg.skip
OR
mvn package -DskipTests -Dgpg.skip
echo "export JAVA_OPTS="-Xms512m -Xmx1024m"" >> ~/.bashrc
source ~/.bashrc
cd /home/userx/extract/extract-cli/
sudo apt-get install libxtst6:i386
sudo apt-get update
sudo apt-get install libxtst6
sudo updatedb
locate libXtst
sudo apt install libxext6
sudo apt-get install libxrender1 libxtst6 libxi6
java -jar extract-cli.jar
result
usage: extract [command] [options]
usage: extract help
usage: extract version
A cross-platform tool for distributed content-extraction by the data team
at the International Consortium of Investigative Journalists.
Commands
load-report
rollback
wipe-report
spew-dump
clean-report
view-report
inspect-dump
commit
load-queue
rehash
wipe-queue
delete
version
help
dump-queue
spew
copy
tag
queue
dump-report
Additional Image Formats
jpg
bmp
gif
wbmp
png
jpeg
jbig2
Extract will use up to 1 GB of memory on this machine.
Please report issues at: https://github.com/ICIJ/extract/issues.
result
javac 1.8.0_191
Apache Maven 3.5.4
Maven home: /usr/share/maven
Java version: 1.8.0_191, vendor: Oracle Corporation, runtime: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.18.0-10-generic", arch: "amd64", family: "unix"
tesseract 4.0.0-beta.3-249-g607e
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE
/usr/lib/x86_64-linux-gnu/libXtst.so.6
/usr/lib/x86_64-linux-gnu/libXtst.so.6.1.0
https://gorails.com/setup/ubuntu/18.10
ruby --version
curl -sL https://deb.nodesource.com/setup_8.x | sudo -E bash -
cd ..
curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add -
echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list
sudo apt-get update
sudo apt-get install git-core curl zlib1g-dev build-essential libssl-dev libreadline-dev libyaml-dev libsqlite3-dev sqlite3 libxml2-dev libxslt1-dev libcurl4-openssl-dev software-properties-common libffi-dev nodejs yarn
cd
git clone https://github.com/rbenv/rbenv.git ~/.rbenv
echo 'export PATH="$HOME/.rbenv/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(rbenv init -)"' >> ~/.bashrc
exec $SHELL
git clone https://github.com/rbenv/ruby-build.git ~/.rbenv/plugins/ruby-build
echo 'export PATH="$HOME/.rbenv/plugins/ruby-build/bin:$PATH"' >> ~/.bashrc
exec $SHELL
The documentation shows no way to log into a solr server secured with user/password authentication.
Can it be done with a command line option?
Fails to build maven-gpg-plugin with following error even if -Dpgp.skip-true is set.
$ mvn install -Dpgp.skip-true
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.inject.internal.cglib.core.$ReflectUtils$1 (file:/usr/share/maven/lib/guice.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of com.google.inject.internal.cglib.core.$ReflectUtils$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for org.icij.extract:extract-lib:jar:3.6.1
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-surefire-plugin is missing. @ org.icij.extract:extract:3.6.1, /home/arky/Code/Tika/extract/pom.xml, line 106, column 21
[WARNING]
[WARNING] Some problems were encountered while building the effective model for org.icij.extract:extract:pom:3.6.1
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-surefire-plugin is missing. @ line 106, column 21
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
[INFO] Inspecting build with total of 3 modules...
[INFO] Installing Nexus Staging features:
[INFO] ... total of 3 executions of maven-deploy-plugin replaced with nexus-staging-maven-plugin
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] ICIJ Extract [pom]
[INFO] extract-lib [jar]
[INFO] extract-cli [jar]
[INFO]
[INFO] ----------------------< org.icij.extract:extract >----------------------
[INFO] Building ICIJ Extract 3.6.1 [1/3]
[INFO] --------------------------------[ pom ]---------------------------------
[INFO]
[INFO] --- maven-gpg-plugin:1.5:sign (sign-artifacts) @ extract ---
gpg: no default secret key: No secret key
gpg: signing failed: No secret key
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for ICIJ Extract 3.6.1:
[INFO]
[INFO] ICIJ Extract ....................................... FAILURE [ 0.132 s]
[INFO] extract-lib ........................................ SKIPPED
[INFO] extract-cli ........................................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 0.601 s
[INFO] Finished at: 2021-03-29T23:18:19+07:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-gpg-plugin:1.5:sign (sign-artifacts) on project extract: Exit code: 2 -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
Better to standardise on an existing standard. Expect camel-case property names - that means the Options library will have to convert option-name to optionName when generated a properties object.
Ignore files like .DS_Store
inside archives. Otherwise an exception like the following is logged:
Sep 13, 2016 4:48:44 PM org.icij.extract.core.ParsingEmbeddedDocumentExtractor parseEmbedded
SEVERE: Unable to parse embedded document in document: Archive.zip.
org.apache.tika.exception.TikaException: Unsupported media type: multipart/appledouble.
at org.icij.extract.core.ErrorParser.parse(ErrorParser.java:55)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at org.icij.extract.core.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:101)
at org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:219)
at org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:182)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.icij.extract.core.ParsingReader$ParsingTask.run(ParsingReader.java:267)
at org.icij.extract.core.TextParsingReader$ParsingTask.run(TextParsingReader.java:87)
at java.lang.Thread.run(Thread.java:745)
Maybe that is intended, but.:
./target/extract-capsule.x spew -o stdout /usr/share >/dev/null
Will give lots of errors like.:
Jul 29, 2016 12:13:33 PM org.icij.extract.core.Consumer extract
SEVERE: The document stream could not be read: /usr/share/doc/console-setup.
org.apache.tika.io.TaggedIOException: Is a directory
Furthermore - i find the period after the SEVERE line annoying - it is hard to know if it is part of the name reported or not.
Could you please add some instructions on how to run this?
Is your feature request related to a problem? Please describe.
We are currently using Java 8, which is largely deprecated. To benefit new versions of dependencies, we would like to upgrade to Java 11.
I've discovered two problems relating to file path resolution (which I'd be happy to try to fix with some guidance).
--file-output-directory
with spew
only works with a the first folder specified in a path. So specifying the output as C:/folder
actually only sends something to C:/
.C:/
) causes an InvalidPathException
, (altering the path so that it uses a server address, e.g., //myfs/
, causes problems as well)Hi - not succeeding in building, after following Wiki build instructions:
After 'mvn install' on command line...
error on Mac:
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 07:00 min
[INFO] Finished at: 2017-08-13T12:55:02+02:00
[INFO] Final Memory: 21M/137M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project extract: Could not resolve dependencies for project org.icij.extract:extract:jar:2.0.0: The following artifacts could not be resolved: org.icij.kaxxa:kaxxa-concurrent:jar:1.0-SNAPSHOT, org.icij.kaxxa:kaxxa-events:jar:1.0-SNAPSHOT, org.icij.kaxxa:kaxxa-io:jar:1.0-SNAPSHOT, org.icij.kaxxa:kaxxa-sql:jar:1.0-SNAPSHOT: Could not find artifact org.icij.kaxxa:kaxxa-concurrent:jar:1.0-SNAPSHOT in jitpack.io (https://jitpack.io) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
REPRODUCING SAME ERROR (on AWS Ubuntu 15.6.0
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 07:00 min
[INFO] Finished at: 2017-08-13T12:55:02+02:00
[INFO] Final Memory: 21M/137M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project extract: Could not resolve dependencies for project org.icij.extract:extract:jar:2.0.0: The following artifacts could not be resolved: org.icij.kaxxa:kaxxa-concurrent:jar:1.0-SNAPSHOT, org.icij.kaxxa:kaxxa-events:jar:1.0-SNAPSHOT, org.icij.kaxxa:kaxxa-io:jar:1.0-SNAPSHOT, org.icij.kaxxa:kaxxa-sql:jar:1.0-SNAPSHOT: Could not find artifact org.icij.kaxxa:kaxxa-concurrent:jar:1.0-SNAPSHOT in jitpack.io (https://jitpack.io) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
Trying to compile/install extract the following error is obtained:
$ mvn install -X -DskipTests
[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /home/user/extract/src/main/java/org/icij/extract/cli/Main.java:[119,48] **method invoked with incorrect number of arguments; expected 1, found 0**
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 14.640 s
[INFO] Finished at: 2017-10-30T04:52:02+01:00
[INFO] Final Memory: 39M/101M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.0:compile (default-compile) on project extract: Compilation failure
[ERROR] /home/user/extract/src/main/java/org/icij/extract/cli/Main.java:[119,48] method invoked with incorrect number of arguments; expected 1, found 0
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.0:compile (default-compile) on project extract: Compilation failure
/home/user/extract/src/main/java/org/icij/extract/cli/Main.java:[119,48] **method invoked with incorrect number of arguments; expected 1, found 0**
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
at sun.reflect.NativeMethodAccessorImpl.invoke0(java.base@9-internal/Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(java.base@9-internal/NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(java.base@9-internal/DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(java.base@9-internal/Method.java:531)
at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.compiler.CompilationFailureException: Compilation failure
/home/user/extract/src/main/java/org/icij/extract/cli/Main.java:[119,48] method invoked with incorrect number of arguments; expected 1, found 0
at org.apache.maven.plugin.compiler.AbstractCompilerMojo.execute(AbstractCompilerMojo.java:1029)
at org.apache.maven.plugin.compiler.CompilerMojo.execute(CompilerMojo.java:137)
at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
... 20 more
[ERROR]
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.