Comments (14)
Another problem occurs but it's probably related to java bugs :
2017-10-02 15:30:27,602 19221 WARN [ParsingThread-15] i.u.d.l.b.f.ParsingThread - Exception while fetching https://www.teamschramm.com/robots.txt
javax.net.ssl.SSLException: Received fatal alert: internal_error
at sun.security.ssl.Alerts.getSSLException(Alerts.java:208)
at sun.security.ssl.Alerts.getSSLException(Alerts.java:154)
at sun.security.ssl.SSLSocketImpl.recvAlert(SSLSocketImpl.java:2033)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1135)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1385)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1413)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1397)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:396)
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
at org.apache.http.impl.conn.BasicHttpClientConnectionManager.connect(BasicHttpClientConnectionManager.java:323)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:221)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:191)
at it.unimi.di.law.bubing.util.FetchData.fetch(FetchData.java:325)
at it.unimi.di.law.bubing.frontier.FetchingThread.run(FetchingThread.java:272)
from bubing.
Mmmh. HTTPS has continuously this kind of problems—the library is set to very tight security, while browsers are, as always, forgiving. I guess there's an HTTPClient parameter to disable this check...
from bubing.
So after digging around, it may be actually related to Java version and LetsEncrypt certs : https://stackoverflow.com/questions/34110426/does-java-support-lets-encrypt-certificates
Except that the affected sites do not seem to be using these certificates. Anyway I think it would be a good idea to find a setting that disables SSL verification.
from bubing.
More tips here :
protected static final class BasicHttpClientConnectionManagerWithAlternateDNS extends BasicHttpClientConnectionManager { static Registry<ConnectionSocketFactory> getDefaultRegistry() { return RegistryBuilder.<ConnectionSocketFactory> create() .register("http", PlainConnectionSocketFactory.getSocketFactory()) .register("https", new SSLConnectionSocketFactory(SSLContexts.createSystemDefault(), new String[] { "TLSv1.2", "TLSv1.1", "TLSv1", "SSLv3", "SSLv2Hello", }, null, SSLConnectionSocketFactory.getDefaultHostnameVerifier())) .build(); }
public BasicHttpClientConnectionManagerWithAlternateDNS(final DnsResolver dnsResolver) { super(getDefaultRegistry(), null, null, dnsResolver); } }
I think the problem is with the getDefaultHostnameVerifier : it probably is the Strict Version :
https://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/http/conn/ssl/
Probably should take the NoopHostnameVerifier or at least the BrowserCompatHostnameVerifier
from bubing.
Here is a code that seems to work (in FetchingThread) :
/** An SSL context that accepts all certificates */
private static final SSLContext TRUST_ALL_CERTIFICATES_SSL_CONTEXT;
static {
try {
TRUST_ALL_CERTIFICATES_SSL_CONTEXT = SSLContexts.custom().loadTrustMaterial(null, new TrustStrategy() {
public boolean isTrusted(X509Certificate[] arg0, String arg1) throws CertificateException {
return true;
}}).build();
}
catch (Exception cantHappen) {
throw new RuntimeException(cantHappen.getMessage(), cantHappen);
}
}
/** A support class that makes it possible to plug in a custom DNS resolver. */
protected static final class BasicHttpClientConnectionManagerWithAlternateDNS
extends BasicHttpClientConnectionManager {
static Registry<ConnectionSocketFactory> getDefaultRegistry() {
// setup a Trust Strategy that allows all certificates.
//
SSLContext sslContext = TRUST_ALL_CERTIFICATES_SSL_CONTEXT;
return RegistryBuilder.<ConnectionSocketFactory> create()
.register("http", PlainConnectionSocketFactory.getSocketFactory())
.register("https",
new SSLConnectionSocketFactory(sslContext,
new String[] {
"TLSv1.2",
"TLSv1.1",
"TLSv1",
"SSLv3",
"SSLv2Hello",
}, null, new NoopHostnameVerifier()))
.build();
}
public BasicHttpClientConnectionManagerWithAlternateDNS(final DnsResolver dnsResolver) {
super(getDefaultRegistry(), null, null, dnsResolver);
}
}
from bubing.
OK. I think we should add a parameter here—there are possibly security risks involved. But I agree, the more we are compatible, the better.
from bubing.
Agreed, however, allowing for self-signed certificates is in itself sufficient for totally blowing up SSL security, unless you have pinned the certificates beforehand.
from bubing.
I see. Did you check whether
https://www.superprof.fr/cours/toute-matiere/marseille/
does work when addressed directly (i.e., not through redirects)?
from bubing.
The problem is the inherited methods are logged with the base class. Try this:
public class Test {
public static class A {
void a() {
throw new RuntimeException();
}
}
public static class B extends A {}
public static void main(String a[]) {
new B().a();
}
}
from bubing.
Hi, yes I checked that's why I removed my comment, the problem is not that the ConnectionManager is ignored, just that this particular SSL connection fails, wether it's from a redirect or not. It's probably a java bug, I'll have to dig deeper to correct this.
from bubing.
So, after a bit more digging, it seems that the problem is that, for some reason, the SSL layer tries to initiate a SSLv2 handshake with some sites, and they reply harshly. Removing SSLv2 from the list of supported protocols in the ConnectionManager constructor alleviates the problem (there are still some sites in error though). However I guess that this protocol was added to support some sites ?
As a consequence, the only way, in my opinion, to deal with the problem would be to have several connection managers with different supported protocols, catching SSL exceptions and retrying with another Cx Manager when they happen.
from bubing.
Well, I think I copied that list of protocols somewhere, to include all protocols. We might make that list an optional parameter, too. But how many sites would be affected by this?
from bubing.
Maybe a hint in this blog post : https://jve.linuxwall.info/blog/index.php?post/TLS_Survey
It's not recent but probably interesting though.
from bubing.
Default is now to accept all certificates; an option brings back the previous behaviour.
from bubing.
Related Issues (20)
- Distribution not working as expected HOT 8
- HTML5 charset declaration not detected HOT 2
- ParsingThread blocked by jgroups HOT 3
- NoSuchMethodException with default configuration (IsProbablyBinary.valueOf()) HOT 3
- BUbiNG should parse content streams of length 0 HOT 1
- WorkbenchEntry-based scheduling HOT 8
- robots.txt parsed as ISO-8859-1 - break when there's a UTF-8 BOM HOT 1
- ignores nofollow on button - adds items to cart
- 301 redirects on too many otherwise accessible pages (via wget from same server or browser) HOT 12
- Implementing robots Google parser
- Any simple tutorial on how to start Bubing HOT 1
- ivy.xml outdated HOT 2
- maxUrls config not honored HOT 6
- Unable to run compiled jar
- Gracefully recover crawl when unexpectedly stopped HOT 1
- URLMatchesRegex seems not to be working HOT 30
- Duplicates or 403 are not taken into account by the maxUrlPerSchemeAuthority limit HOT 2
- https urls are actually fetch using http HOT 3
- Hosts with same IP address are not processed by the same node, so IP delay cannot be enforced HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bubing.