linkedin / url-detector Goto Github PK
View Code? Open in Web Editor NEWA Java library to detect and normalize URLs in text
A Java library to detect and normalize URLs in text
Hey folks, this library is super useful. There is only one release available on Maven Central: https://search.maven.org/artifact/com.linkedin.urls/url-detector/0.1.17/jar
Any chance this could be updated?
Detecting the following url
www.foo1111111111aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.com
Regex you are using is detecting a number as a valid URL, I try to test your regex with: this is a test 1
and the regex detected the number 1 as a valid url
Error:PARSE ERROR:
Error:unsupported class file version 52.0
Error:...while parsing com/linkedin/urls/HostNormalizer.class
Error:1 error; aborting
Error:Execution failed for task ':app:transformClassesWithDexForDebug'.
com.android.build.api.transform.TransformException: com.android.ide.common.process.ProcessException: java.util.concurrent.ExecutionException: java.lang.UnsupportedOperationException
I see that this is fixed in a8dec18 (not in 0.1.17 release on maven central), just creating the issue here to document.
Executing the following code Url.create("http://013.xxx/");
is resolved with the following error:
java.net.MalformedURLException: We couldn't find any urls in string: http://013.xxx/
at com.linkedin.urls.Url.create(Url.java:69)
It looks like as if the utility treats the xxx part as invalid ip instead of a valid suffix.
Excepted result:
Url should be created, host should be 013.xxx
While #6 and #2 is still being resolved, as a temporary workaround, you can use Jitpack:
add:
...
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
....
<dependency>
<groupId>com.github.linkedin</groupId>
<artifactId>URL-Detector</artifactId>
<version>2a0fede05e</version>
</dependency>
to your pom.xml.
Would it be possible to upload your latest version to MavenCentral? In particular, we would like to take advantage of ae214b7
Our project that includes URLDetector includes JUnit4 unit tests which seem to be causing problems because of testng.
Thanks
It might help get visibility as Uima has a lot of other nice plugins.
This string (excluding the double quotes) triggers a StringIndexOutOfBoundsException:
"://VIVE MARINE LE PEN//:@."
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:908) ~[na:1.8.0_60]
at java.lang.StringBuilder.substring(StringBuilder.java:76) ~[na:1.8.0_60]
at java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:854) ~[na:1.8.0_60]
at java.lang.StringBuilder.substring(StringBuilder.java:76) ~[na:1.8.0_60]
at com.linkedin.urls.detection.UrlDetector.readDefault(UrlDetector.java:191) ~[url-detector-0.1.17.jar!/:na]
at com.linkedin.urls.detection.UrlDetector.detect(UrlDetector.java:142) ~[url-detector-0.1.17.jar!/:na]
Dependency of apache is not in build.gradle, really?
Thanks for a very useful library.
I note that the list of valid schemes is fairly small and this means that a URL with a file: schema is not parsed correctly, giving back http as the default schema. Could you add file: to the list of valid schemas, or perhaps create an option that allows anything that looks like a schema to be returned but perhaps with the addition of something like boolean isKnownSchema()
Cheers.
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -2
at java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:908)
at java.lang.StringBuilder.substring(StringBuilder.java:76)
at java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:854)
at java.lang.StringBuilder.substring(StringBuilder.java:76)
at com.linkedin.urls.detection.UrlDetector.readDefault(UrlDetector.java:191)
at com.linkedin.urls.detection.UrlDetector.detect(UrlDetector.java:142)
at com.mycompany.url.UrlTest.main(UrlTest.java:26)
Thanks for the logic! Any chance you're planning on deploying to Maven central?
If you run the detector in the text below, it thinks the whole text is a URL.
我进入你的主页很卡顿,也许是你的关注人数或者其他数据太多了,其他人主页没有这么卡顿。来自amethyst客户端
Characters 。 and , are single characters and are not considered spaces in this library.
When parsing a URL like "linkedin.com", the url object will add a default scheme of 'http' if one is not detected: URL.getScheme()
I can understand why some defaults were included but it would be nice if this behavior could be configured. I need to know whether the original input text contained the scheme.
I can always do something like url.getOriginalUrl().startsWith(url.getScheme()) but I don't want to have to do that everywhere.
In the interim, while issue #2 is being worked on, it would be ideal if it were possible to install the url-detector library in the local Maven repository (typically ~/.m2/repository/
), so that other Maven-based build tools can consume the library.
Note that issue #2 is a vastly preferable solution to this problem, but allowing local installation (this issue) provides a short-term workaround.
String text = ".............:::::::::::;;;;;;;;;;;;;;;::...............................................:::::::::::::::::::::::::::::...................."; UrlDetector d = new UrlDetector(text, UrlDetectorOptions.Default); d.detect();
Running this will throw
Exception in thread "main" java.lang.NegativeArraySizeException: Backtracked max amount of characters. Endless loop detected. Bad Text: ':...............................................:::::::::::::::::::::::::::::....................' at com.linkedin.urls.detection.InputTextReader.checkBacktrackLoop(InputTextReader.java:144) at com.linkedin.urls.detection.InputTextReader.seek(InputTextReader.java:120) at com.linkedin.urls.detection.UrlDetector.readUserPass(UrlDetector.java:511) at com.linkedin.urls.detection.UrlDetector.readScheme(UrlDetector.java:458) at com.linkedin.urls.detection.UrlDetector.processColon(UrlDetector.java:293) at com.linkedin.urls.detection.UrlDetector.readDefault(UrlDetector.java:253) at com.linkedin.urls.detection.UrlDetector.detect(UrlDetector.java:142) at Main.main(Main.java:82) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
if i have a text contains 10.00hr, it is consider as a URL
runTest("10.00hr,", UrlDetectorOptions.Default);
it should return empty, but the results is [http://10.00hr]
There are several issues (including fixes in pull requests) that are unaddressed in a long time. Could this be handed over to other maintainers? @tzuhanjan can you comment?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.