lapplislazuli / hopinosis Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 200 KB

Opinosis Implementation in Haskell

License: MIT License

Haskell 100.00%

nlp nlp-library opinosis opinosis-summarizer text-summarization

hopinosis's Introduction

Hi There! I'm Leonhard. 👋

Welcome to my Github Profile.

I am currently doing my PhD at the TU Delft in the Software Engineering Research Group (SERG) where I work in the CISELab. My research track is labelled "AI Testing and Testing AI" - So I do twice the AI and twice the Testing that other researchers do. I am very much the Thermomix of Software Engineering AI. But recently, I have spent most of my work-time with Matthi on improving tooling for Haskell!

I like:

☕ Java & 💜 Haskell
🔍 Testing
👥 Humans

I try to do everything open source, so most of the repositories you see here have .

If you want to reach me, you can either

📧 email me at [email protected]
❗ Open an issue anywhere around and @ me

hopinosis's People

Contributors

Watchers

hopinosis's Issues

Return Graph as GraphSON

GraphSON is nice to visualize the Graph with other Tools and increase the interoparability with other Libraries and tools. It would be good to get a graph returned, rather than "stopping" only at the finished summary.

Proposed Solution
Add a function toGraphSON :: Graph -> GraphSON and toGraphSONstring :: Graph -> String.

Add Interfacing Methods in Hopinosis.hs which help with using these, such as
parseToJson :: String -> GraphSON and String -> String.

Possible Alternatives:
Maybe the Text Datatype should be used.

Maybe only the helper-functions should be exposed, as the interface functions are rather redundant and any haskeller with a short readme should be able to use it.

Possible Problems:
The GraphSON should be done via Library. However, Aeson (which is a good library) maybe exorbitantly increase the build times. The CI is already at 7 Minutes. :(

Additional Context:
This is something for v2.

GraphSON Spec

Cabal GenBounds

As requested by hackage, to Cabal GenBounds

QuickCheck-Tests

Quickcheck can be used to have a nice check of some of the properties of my graph.

Proposed Solution
Write some quickcheck-tests for

graph being monoid
graphs being distinct after parsing (no nodes duplicated)
values being monoid

run them with the normal cabal build.

Possible Problems:
There are maybe problems with running both quickcheck and hunit.

Maybe creating arbitrary instances of the types will turn out bad.

Additional Context:
The arbitrary instances can be perfectly used for manual testing as well.

This means there is a huge value in just being able to generate random graphs,nodes and paths.

use "common" in Cabal

The cabal file has a lot of duplication - which would be better removed.

Proposed Solution
Use a common section in the .cabal file such as in the cabal documentation

Possible Problems:
It might hurts and I let it be.

Related Issues:
This will also reduce the work for #18

Add Github Action Badge

Let's show the world that I can use Github actions!

Sentence-Similiarity

To pick most redundant, but also distinct sentences, somehow I need to compare every sentence to every already chosen sentence.

Proposed Solution
There should be atleast one function which gets the distance of two paths/sentences.

Then there should be a function which somewhat weights the metric-score with the distance to already chosen sentences.

Possible Problems:
Maybe it's hard to make a nice, functional solution for it.

Related Issues:
This is a subtask for #11

Additional Context:
There are many ways to compare sentence-similiarity. One Example Article

Add Github Action

CI is always nice.

Cabal new-test is running, so CI is easy.

Either look for new Github Actions (They are cool) or just copy past the travis.yml from Chesskell

Add Github Templates

For Issues, Commits and PR's

Use Data.Map as outs

Currently the "outs" is a list of tuples (String,Int) which is a good case for a Data.Map

http://hackage.haskell.org/package/containers-0.6.2.1/docs/Data-Map-Strict.html

More Metrics

The Metrics seem to have the biggest impact on the results. Some more would be great.

Proposed Solution

Look for more possible Metrics
Make easy functions to combine Metrics, such as

metricAccumulator :: Metric -> Metric -> (Double -> Double -> Double) -> Metric

Possible Problems:
Metrics could be slow or otherwise flawed.

Additional Context:
Metrics should be reasonable fast.

NFData Instance for Node

Data Node should be forceable, to make the complete Graph force-evaluatable.

Proposed Solution
Implement the NFData Instance for Node.

Possible Alternatives:
Maybe NFData can be derived and does not need to be made

Related Issues:
This is another step towards #15, which enables me to run all the graph parsing beforehand and get therefore better numbers to work with.

Additional Context:
NFData Source

Coverage

Code Coverage for a library is very important.

Proposed Solution
Find a way to enforce a certain code coverage in the build process.

In addition, it would be nice to display the coverage in the Readme.

Possible Alternatives:
Do not enforce the code coverage, just look at it from time to time

Possible Problems:
Maybe there are not so strong tools for code coverage in haskell

Additional Context:
The HPC site how to run it in cabal

RoadMap

Roadmap until v1

Use Data.Map as graph

For the Graph i'm using a list of tuples (String,Values), which is a perfect case for Data.Map

http://hackage.haskell.org/package/containers-0.6.2.1/docs/Data-Map-Strict.html

There is also a "Update" Function which may help me for the "AddIfNotAdded"

Make Values Monoid

The Values DataType is perfectly fit for being a monoid.

It consists only of already monoidic datatypes.

The mappend is already "mergeValues" and the mempty is emptyValues

The attributes should all be fullfilled, so are showing tests.

Add Haddock

At Haddock Documentation to the exposed functions of each module.

Proposed Solution
See above

Possible Alternatives:

Hadockify everthing
Hadockify nothing

Possible Problems:
Maybe I will still remove functions, then the docu will be wasted.

Atleast have it in place and everything running. In case i want to do more or less docu, its still nice to have it in the git history and have it done once.

Additional Context:
Haddock readme: https://haskell-haddock.readthedocs.io/en/latest/index.html

Integration-Tests don't find symlink

The CI Pipeline tries to invoke the Hopinosis program after installation, but the command is unkown.

To Reproduce

See the failing action run

To reproduce, one can rerun the action.

Expected behavior

The symlink should be used and the command should be (successfully) run.

Instead the symlink is not found.

Possible Workaround

Maybe the binary can be run from the file-path.
However, the cabal build should contain a version number as well as some other parameters which makes it rather hard to do so.

Additional context

The initial python file ran perfectly fine on my machine.
Apparently the only missing piece is to run the application properly.

Banging Graph does not do anything

The time taken for several graphs is the same.
This is an indicator that something is odd.

To Reproduce

Run:

Hopinosis ./Files/darkwing.txt 2 0.51 0.51

and

Hopinosis ./Files/coon.txt 2 0.51 0.51

They both take 0.00006 seconds, which is an oddly short number and more importantly they should differ.

Expected behavior

Building the graph should

a) Take longer
b) Take different long

Related Issues

This is another step towards #15 or rather the fix of it.

Additional context

I have tried to have the NFData for graph in place, which seems to be working.

I have taken it from a different library in that form, so it is very much the same as tuple at the moment.

Split Metric into "Selector" and "Metric"

The current Metric has a lot of stuff in it, which are not about Metrics, but about selection.
These things should be seperated.

Proposed Solution
Split Metric.hs into Metrics.hs and Selection.hs

Possible Alternatives:

Keep as is
Split Selection into "Selection" and "Selection internals" just like Paths

Possible Problems:
None, it's just effort, and maybe it doesn't make more sense after the split

Related Issues:
None.

Jaqqard Distance

Add the jaqqard distance.

Proposed Solution
Add a function jaqqard :: DistanceFunction to Metrics

The jaqqard-function is the intersection of words in two sentences.
The function "toVectors" should be usable.

Low Performance

Summarizing takes quite a long time, given enough sentences

To Reproduce
Run:
opinosisSummary "Hi, i am leonhard. I like cats. cats like leonhard. Ice cream. Harakiri. This is the longest sentence by far. Cats will rule the world, because they are unstoppable. Samurais liked Ketchup and Leonhard. Cats cannot be Samurai, because they have furry paws. Japanese Cats are by far the furriest"

This took about a minute on a single processor, which is somewhat to long.

Expected behavior
Be much faster, for above example <10s

Proposed Solutions
There are multiple possible ways:

measure performance, and look where it is lost
more restrictive standard-selection (e.g. higher sigma theta)
some (faster?) libraries for cosine distance
maybe: make summaries in splits?
"bestPaths" and some others should be good candidates for multi-threading, but how are multithreaded libraries supported?

Desktop:

OS: Windows 8
GHCI 8.6.5
Version d9bf4a7

Additional Information:

I was thinking about the default values ... If the averaged path strength is smaller than 1, it is not redundant at all? So only if a path is actually interlapping with anything and sharing atmost one word, it get a higher averaged path strength than 1?

Look into "Generalised derived instances for newtypes"

There is a Haskell Language extension called Generalised derived instances for newtypes, which "gives every function the wrapped type has". It suits well for my Graph, which is just a specific Data.Map.

Proposed Solution
Look into the language extension, make sure everything still runs and expose everything the Data.Map has to offer.

Remove home-brew functions where possible.

Possible Alternatives:
A clear and concise description of any alternative solutions or features you've considered.

Possible Problems:
Maybe Cabal will make some problems as the cabal file will need reference on the language extension.

Additional Context:
Moving to Data.Map made the code much more readable and easier to use. I think giving every function from Data.Map yields little danger and makes further developing the Library (and using it) much easier.

Move from String to Text

I hear only bad stuff about Strings in Haskell (https://mmhaskell.com/blog/2017/5/15/untangling-haskells-strings) so I think it´s better to use text.

Better do this as soon as possible.

Also the language extension {-# LANGUAGE OverloadedStrings #-} is used, but i don´t think i need it for my current scope.

lapplislazuli / hopinosis Goto Github PK

hopinosis's Introduction

Hi There! I'm Leonhard. 👋

hopinosis's People

Contributors

Watchers

hopinosis's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs