google / haskell-indexer Goto Github PK
View Code? Open in Web Editor NEWEmits code crossreference data for Haskell sources.
Emits code crossreference data for Haskell sources.
Use LRU or bloom filters. Kythe dedupes anyway, but when we start emitting types there will be a lot of unnecessary content to process.
Both uni and bidirectional.
How to map a small part of GHC AST to Kythe schema. A small tutorial.
Now the type support in Translate layer is pretty weak (strings), and non-existent in Kythe frontend.
We should expose a stripped-down type to the Translate layer. Approximately able to represent forall a (b :: k) . (Ctx a, Foo a b) => [...] -> [...] -> [...]
.
Foo a b => a -> b
:
abs Avar Bvar
tapp constr#1fn#2 vname(b) vname(Foo) vname(a)
.Note that Kythe schema convention expects return type to be first parameter.
Also, Avar
and Bvar
are new absvars
bound by the (implicit) foralls. So generally two functions foo :: a -> a
and bar :: a -> a
won't have the same Kythe type vname, since the absvars will differ. We gave some thought to this, and realized that full proper type-level querying likely needs a separate index, so we won't stress to fit all the abstract details into the Kythe schema.
We choose to fake constraints as additional parameters until Kythe has better support for them. Note that this is can result of things having different types depending on the Constraint tuple order. But since polymorphic things will generally have separate Kythe type vname anyway, this is not a big loss.
Hi,
We're currently reviewing haskell-indexer for use on auditing work. I wrote a tool called sift which is capable of generating a simple cross-package call graph of a haskell package and writing it to a .json file. Example:
$ sift trace sift-bindings/*/* --flag-binding "ghc-prim GHC.Prim raise#" --call-trace | head -n 30
Flagged binding: ghc-prim:GHC.Prim.raise#
Used by aeson:Data.Aeson.Encoding.Builder.day
Call trace:
aeson:Data.Aeson.Encoding.Builder.day
|
+- base:GHC.Real.quotRem
| |
| +- base:GHC.Real.divZeroError
| | |
| | `- ghc-prim:GHC.Prim.raise#
| |
| `- base:GHC.Real.overflowError
|
`- base:GHC.Err.error
Used by aeson:Data.Aeson.Encoding.Builder.digit
Call trace:
aeson:Data.Aeson.Encoding.Builder.digit
|
`- base:GHC.Char.chr
|
`- base:GHC.Err.errorWithoutStackTrace
|
`- base:GHC.Err.error
|
`- ghc-prim:GHC.Prim.raise#
Preferably, we'd like to use haskell-indexer to achieve the same thing instead of maintaining two codebases.
I think I can get the same info from TickReference
and Tick
and XRef gives me a list of TickReferences
and also a list of Relation
.
I think from there I can produce a graph with Data.Graph
by producing a list [(nodeid,node,[nodeid])]
where the latter list is "my dependencies", like I do here
-- | Graph all package bindings.
graphBindings ::
Set Binding
-> OrdGraph BindingId Binding
graphBindings bs =
ordGraph (map
(\binding -> (binding, bindingId binding, bindingRefs binding))
(Set.toList bs))
then I can produce a simple call graph like this:
callTrace :: OrdGraph BindingId node -> Graph.Vertex -> Graph.Vertex -> [Tree [Char]]
callTrace g start flagged =
fmap
(fmap
(\v' ->
let (_, bid', _) = ordGraphVertexToNode g v'
in S8.unpack (prettyBindingId bid')))
(filterForest
(flip (Graph.path (ordGraphGraph g)) flagged)
(Graph.dfs (ordGraphGraph g) [start]))
So I'm 90% confident I can fairly readily get the information I need to obsolete the sift
tool. I have some questions:
throw#
is used by read "x" :: Int
because the method instance for Int
uses error
? We need this for auditing, aside from it being a super cool feature in general.base
? What's your approach on that? We need this for auditing. Here's how sift
tackles the tooling issues:
lib
dir, on the base
package specifically. You just inject --frontend
in the right place after building sift-frontend-plugin
in the same package set. That lets you generate a profile of base. If you had trouble with this on haskell-indexer, maybe I can help out getting that to work.stack ghci --with-ghc sift-compiler
and then you're done. I believe cabal repl --with-ghc sift-compiler
would also work, but I haven't tested it. Why not just stack ghci --with-ghc ghci --ghci-options '--frontend Sift.FrontendPlugin'
? Because GHC rejects frontend plugins when used with --interactive
. ๐คทโโ๏ธShould run the Kythe verifier tests and the intermediate tests.
I was trying to index bytestring
and found out that it couldn't be done with haskell-indexer
and also it looks like it breaks package db by unregistering bytestring
so I needed to run stack setup --reinstall
to fix the situation. Also it seems to be related to the question from @chrisdone in #79 about base
but it looks like base
is not enough and dependencies of GHC which come with it need special treatment. I see in #56 scripts from you @robinp but it looks like it doesn't cover libraries which are in git submodules of GHC repo - why didn't they get indexed? BTW shouldn't those scripts also be included into this repo with some file describing how to index GHC itself (at the moment it's not clear to me where that /opt/ghc/bin/ghc
come from - probably it's about your custom install of GHC into /opt/ghc
?)
For currently indexed module (easy): somehow report a list of missing decls, that were targeted by refs.
Globally: less trivial. Maybe given a set of packages, only report missing decls among these packages?
Can Kythe do any of these for us in some way (maybe in the verifier mode?). @creachadair
See also #15 , but less clear path forward, as Haskell doesn't need an explicit SWIGging phase, can just live with foreign import
s. See https://groups.google.com/forum/#!topic/kythe-dev/Fb7HmZffRtw for a related discussion.
The core of the matter is that when emitting references to the C code, we need to use the VName emitted by the C indexer, but that is unpredictable (we need side-channel info from the C indexer somehow).
The matter is further complicated by hsc2hs
, inline-c
and maybe other glue libs. A proper research is needed. Maybe we don't need to support all the edge cases, or these cases are actually separate instances to solve on their own.
Create a virtual source for them.
The GHC backend now supports GHC 7.10 AST.
The code should be adapted so it keeps supporting that, as well the GHC 8 AST.
Note: while there, backwards support for the GHC 7.8 AST would be very easy, since there was only a slight change.
As final result, the named entries in the import list should be properly hyperlinked.
The module names, like Foo
in import Foo (bar, baz)
, should ref/imports
the module node (see #8 ). Also, bar
and baz
should ref/imports
too.
import Foo (Bar(..))
- should we emit multiple ref/imports edges from the ..
s? Or just do nothing there.
What about aliased imports (regardless of qualification)? For example:
import Data.Text as T
...
foo = T.append "x" "y"
Easiest is we do nothing. In any case, T.append
will still refer to the same node. If we emitted a defines/binding on T
of the import line, then the T
of T.append
might ref that. But there's no really suitable node in the kythe schema (unless we shoehorn talias
+ aliases
), and the benefit is dubious. Also, the def/binding would need to come from an implicit anchor, since multiple modules can be aliased to the same name (so the T
in the import line would also be a reference).
...contrary to ghc-pkg unregister
. Spotted by @MaskRay:
Sometimes stack will copy precompiled packages instead of rebuilding (connfigure/build/copy/register), therefore ghc_kythe_wrapper is not invoked.
% ./build-stack.sh /tmp/logs mtl
unregistering would break the following packages: proto-lens-combinators-0.1.0.7 proto-lens-protoc-0.2.1.0 proto-lens-descriptors-0.2.1.0 proto-lens-0.2.1.0 conduit-1.2.11 resourcet-1.1.9 parsec-3.1.11 lens-family-1.2.1 mmorph-1.0.9 kan-extensions-5.0.2 adjunctions-4.3 free-4.12.4 exceptions-0.8.3 haskell-indexer-pipeline-ghckythe-0.1.0.0 haskell-indexer-frontend-kythe-0.1.0.0 kythe-schema-0.1.0.0 kythe-proto-0.1.0.0 haskell-indexer-backend-ghc-0.1.0.0 (ignoring)
mtl-2.2.1: using precompiled package
# After deleting ~/.stack/precompiled/x86_64-linux{,-tinfo6}/ghc-8.0.2/1.24.2.0/mtl-2.2.1/*
% ./build-stack.sh /tmp/'logs' mtl
unregistering would break the following packages: proto-lens-combinators-0.1.0.7 proto-lens-protoc-0.2.1.0 proto-lens-descriptors-0.2.1.0 proto-lens-0.2.1.0 conduit-1.2.11 resourcet-1.1.9 parsec-3.1.11 lens-family-1.2.1 mmorph-1.0.9 kan-extensions-5.0.2 adjunctions-4.3 free-4.12.4 exceptions-0.8.3 haskell-indexer-pipeline-ghckythe-0.1.0.0 haskell-indexer-frontend-kythe-0.1.0.0 kythe-schema-0.1.0.0 kythe-proto-0.1.0.0 haskell-indexer-backend-ghc-0.1.0.0 (ignoring)
mtl-2.2.1: configure
mtl-2.2.1: build
mtl-2.2.1: copy/register
# Though ~/.stack/precompiled/x86_64-linux/ghc-8.0.2/1.24.2.0/mtl-2.2.1/ has been regenerated, further `./build-stack.sh /tmp/logs mtl` does not reuse precompiled one, weird.
https://github.com/commercialhaskell/stack/blob/master/src/Stack/Build/Execute.hs#L1141
Entities in the export list should ref/exports
the actual entities.
At simplest, reexports could also ref the actual entities.
Later if needed we could add a special imported at
edge that would point to the import anchor the reexport is coming from. But possibly this is premature thinking - what would be a UI usecase that needs this feature?
For reference, see #33 where import indexing was done.
Now $(foo)
is indexed with the unspliced content, but the reference to foo
itself doesn't get emitted. This causes for example that backreferences on foo
don't show the usage sites.
Related:
bar = [|$(foo)|]
.[bar|1+1]
.Can we access the pre-splicing info from some of the ASTs?
It's really just a thin binary wrapper, no need to have it in a separate package.
Now we hardcode /opt/kythe
, instead we could expect an environment variable. Kythe uses
KYTHE_ROOT_DIRECTORY
and KYTHE_OUTPUT_DIRECTORY
in their scripts.
The installation instructions are Linux specific. Say in the docs if it works on Windows, or hasn't been tested etc.
They don't seem to get crossreferences.
Now if API fails, it just throws ExitFailure 1 exception. Instead, print a detailed error message.
Hi @robinp,
you recently made a crossreferenced GHC 8.0.1.
Could you make one for 8.2.1 as well (and maybe host them both under qualified URLs, and have http://stuff.codereview.me/#ghc show to the latest one)?
That would be very useful.
Thanks!
(Also I linked your post on Reddit so that not-ghc-devs-subscribers know about it as well!)
GHC implements too much administrative magic in ghc/Main.hs
to replicate them all in GhcApiSupport
. Should restructure the GHC backend to be a Frontend Plugin rather, which will get (almost all) of the magic. See https://ghc.haskell.org/trac/ghc/ticket/14018#ticket.
Notably, the frontend plugin might still need to decide if it's running in Make or Oneshot mode, and act accordingly. For Make mode this means invoking compileFile on non-Haskell sources and sticking the resulting objects to dflags (see GHC's doMake
function). No idea for Oneshot.
All this fuss is needed for 1:1 behavior to real GHC invocations. 95% of the time this is not needed, since machine code generation is not needed for indexing most code.
An exception is TemplateHaskell that runs imported function (in which case at least bytecode generation is needed), or when it runs imported FFI-d function (in which case C compilation, machine code generation and linking are needed too). Other exception are some modules that use FFI in particular ways (maybe foreign exports?) which were not content with just generating bytecode. Also, having optimization turned on is not compatible with bytecode gen (but fine with no gen or machine gen). Etc.
+@mpickering FYI.
For example, tests where:
If a file contains the following definition for go
, only the first occurence of go
is linked. The second occurrence is not linked.
go 0 = ()
go 1 = ()
Take (Cabal package) + (Haskell module) = Kythe package
Emit to index:
Note: import refs to the module are in separate issue, see below.
Profile to see where we spend time.
One suspect is the uniplate
traversals. Could try to use version from Data.Data.Lens
, which caches the possible paths and could speed traversals up.
It would be nice to mark all the stuff the module exports (even if those are defined in an other module, and exported via Haskell's reexport functionality).
That would enable us to make Kythe-assisted auto-import tool.
Could make a Kythe semantic entity (abuse kind=interface?) to represent "module exports", add anchors childof that entity, with the anchors referencing ing the exported semantic nodes. + @creachadair
See http://www.kythe.io/docs/schema/marked-source.html - these are needed for UIs or tools to format names of entities in context-dependent ways.
I ran:
$ ./build-stack.sh tmp-logs mtlparse cpu
cpu-0.1.2: configure
cpu-0.1.2: build
mtlparse-0.1.4.0: configure
mtlparse-0.1.4.0: build
cpu-0.1.2: copy/register
mtlparse-0.1.4.0: copy/register
Completed 2 action(s).
$ cat tmp-logs/.log
========= FAKE GHC =======
== pwd: /home/niklas/src/haskell/haskell-indexer
== Passing through..
/raid/stack/programs/x86_64-linux/ghc-8.0.2/bin/ghc --info
========= FAKE GHC =======
== pwd: /home/niklas/src/haskell/haskell-indexer
== Passing through..
/raid/stack/programs/x86_64-linux/ghc-8.0.2/bin/ghc --numeric-version
Export child relations - for datatype/ctors, class/methods, functions/args ...
This is crucial for generating code outline.
See comments in indexer backend, Kythe verifier comments [1], also https://groups.google.com/forum/#!topic/kythe-dev/Mvus07b8c-U.
[1]: in kythe-verification/testdata/basic/RecordReadRef.hs, also RecordWriteRef.hs
At least make this configurable, in some situations it's useful to have the ref/call edge present.
This is also a long-arching problem: Haskell (or partial application and first-class functions in general) don't have a good notion for distinguishing a reference from a call.
Brought up by @mpickering. A few things to sort out for that question:
Does GHC provide convenient access to the post-processed ASTs, ideally with post-processed spans?
If not, are the post-processed sources accessible and can we do an extra compilation for them to get the spans?
How would we deduplicate definitions/references that are present in both the original and post-processed sources? (Sidenote: IIRC GhcAnalyser drops references that originated from generated code, but not sure if TH falls under that condition).
Where would we place post-processed code (this is a valid question for CPP too)? Kythe supports virtual roots, and generally we can emit whatever code fragments we want anywhere in the tree, but it would have to be thought up what reference/generates/... edges would be present.
Would we emit full postprocessed sources (more problematic duplication-wise), or do some smart thing to just put the TH-generated source fragments in virtual files?
+@creachadair: does for example the Kythe C++ indexer emit virtual fragments for un-CPP-d code? Do you have any takeaways from earlier attempts on this topic?
Hey @robinp,
I can't load http://stuff.codereview.me/lts/9.2/ currently, it seems to load extremely slowly and after 2 minutes times out (some resources do seem to load eventually).
Could you have a look if that should still be working?
Thanks!
The build for kythe-proto
needs access to kythe's storage.proto file. It achieves this by symlinking to ../../../../../third_party/kythe/kythe/proto/storage.proto
. Would it be better to just vendor this specific file? It makes building the package in isolation much more difficult and I had to add some special packaging logic when I packaged it for nix.
The GHC backend needs to fetch the Haddock comments, and put them into the common Translate layer (maybe a bit transformed?), associating them to the correct entity. Then the Kythe frontend should format and emit these.
The Haddocks can be fetched using the Haddock API (has to be opened up, see haskell/haddock#595) based on the GHC AST we already have access to.
Note: a special arg needs to be passed to GHC to have the Doc nodes present in the AST. This is likely Opt_Haddock, can be passed in GhcApiSupport somehow. See this GHC test for inspiration.
Nix provides a very convenient wrapper, ghcWithHoogle
which generates hoogle indexes for all dependencies of your project. It would be similarly convenient if there was a ghcWithIndexer
wrapper which instead built kythe indexes using haskell-indexer
.
I indent to implement this but this ticket is for the contingency that I don't.
As hinted in http://www.kythe.io/docs/schema/writing-an-indexer.html#_cross_references, keeping the signatures in VNames short is crucial for good IO performance.
For non-top/exported entities, now we append disambiguator info like filename, line/col to the signature. Approximately haskell:term:pkg_name:The.Module:thing:/pkg_name/the/file.hs-123-456
.
We could cut down on this.
The current extractor is using Stack, which is not supported by all Haskell projects. Implement Cabal based extractor to fill the gap.
Fails with GHC 7.10:
export STACK_YAML=$(readlink -f stack-6.30.yaml)
stack install
cd kythe_verifier
./test.sh
Verifying: testdata/basic/ImportsRef.hs
Could not verify all goals. The furthest we reached was:
testdata/basic/ImportsRef.hs:5:6-5:89 @"Data.Set" ref/imports vname("containers-0.5.7.1:Data.Set", "", "", "", "haskell")
In the broader context, we might want to make verifier tests dependent on GHC version. Maybe fire up the tests with a light Haskell wrapper (instead shell script), where we can ifdef which tests we want to run. That way CI integrating those tests would also be easier (could move them into the ghc-kythe package).
cc @ivan444
Spotted by @MaskRay:
Sometimes $PWD is not something like ...../mtl-2.2.1 ($package-$version)
Currently the Stack GHC wrapper script prepends the package name to the generated filepaths (see wrappers/stack/ghc
), and also assumes that the files passed in to the compiler command are relative to the current dir (which is usually the case, thus the removal of the ./
prefix in the script).
But maybe sometimes the paths are absolute (like pointing to /home/user/.stack/setup-exe-src/....hs
or /tmp/setup-xyz/Setup.lhs
).
Now we seems to stash these files as relative under the package dir. Which is not too bad, at least they are bundled with the related package. But if these files would be shared/reused between compiling packages, the duplication would be awkward (does that ever happen?)
Just mentioning that the Kythe frontend at least provides the notion of 'root' which we don't use now. Using roots one could segregate (or namespace) paths, which is useful if independent paths have chance to collide inside a single corpora, or if generated files should be explicitly separated.
I'm not sure it's our task to solve this "once and for all" (forall?), but could provide some path rewrite rules facility, and let various build systems set up the rewrite rules as appropriate for them.
The necessary changes are on the branch on my repo - https://github.com/mpickering/haskell-indexer/tree/ghc-8.4.3
It's possible that mpickering@ae09b7f can be cherry-picked without the other changes.
Currently, haskell-indexer
needs to be integrated with each of various build systems. It needs to be concerned with loading the code with the GHC API with appropriate flags so it can be indexed.
This issue is about indexing from a source plugin instead.
This would remove from haskell-indexer
the burden of dealing with the GHC API to parse the code and to load code dependencies. The user would only need to arrange -fplugin Language.Haskell.Indexer
to be passed to all invocations of GHC during a regular build of whatever build system she has in place. This should remove most of the build configuration problems that could make haskell-indexer
fail, and would probably remove the need to use and maintain ghc wrappers.
Note that there was a former issue (#55) for implementing a frontend plugin, which is something different.
I'm not very acquainted with kythe or haskell-indexer
yet, but I'm curious about whether the source plugin idea is as good as it sound from the outside.
We already have C preprocessor (pgmP) override option, but not C compiler (pgmc) yet. The override is useful if the actual path of the command differs from the values of the GHC settings
file.
This might be useful for marking function params.
Needed for writing interesting static linters (for example mask + forkIO)
When the instance is fully resolved.
Basic plan (probably doesn't handle edge cases etc):
HsVar
's id
's idDetails
is ClassOpId
(also isClassOpId_maybe
)classTyCon
from the Class
of the ClassOpId
HsWrap
wrapping the HsVar
contains WpEvApp (EvId x)
, where x
's type is of form TyConApp <the classTyCon> ...
x
is the dFunId
... but how do we go from that (knowing the Class
and the method name) to the instance method var?Idea: get the module of the dFunId
name using nameModule
, then GHC.getModuleInfo
on it. Then we can lookup the ClsInst
list in that module... but it doesn't refer the actual members, just the DFunId
, again. Ok, not the best idea. Can we take that DFunId
apart?
Idea#2: assign Tick to instance methods by concatenating the DFunId tick with the method name, so it's trivial to construct the Tick reference from the locally available info. Sounds like a much better idea.
This can work - but the instance method decls need to be taken from the typechecked tree, where we can harvest the related dFunId (the abe_mono
is easier to find the classTyCon
, since there's no context) by looking at which $c... gets applied in the $d... binding.
Now the ghckythe-wrapper
can be used in place of the ghc
command to emit artifacts, which is fine for local, non-isolated indexing.
The more proper way would be to add extractors for the build systems (here Stack, Cabal build, Cabal new-build, ...) that save all the required inputs in index packs, so the separate indexing phase can happen exclusively based on the hermetic index pack data.
This would make it possible to do reproducible and/or distributed indexing.
The main tasks are:
This is not always trivial, as the deps need to capture auto-generated inputs etc too.
There's also the question of system-level dependencies (like global shared libraries) - should these be assumed omni-present on both the extractor and the indexer machines? Should they be added to the index pack?
How should one use the index pack's content?
The naive solution is to unpack the deps needed by the ComplilationUnit to some local place, and working from there. Attention has to be paid that the build is isolated, and GHC doesn't pick up unexpected dependencies.
The more desired solution is to make the indexer (and so GHC) pull the dependencies on-demand from the index pack (instead of prefetching and extracting). This has the benefit that in case of over-eager extractors (that include more resources in the pack than strictly needed) it's still only the needed data that's pulled.
It would be useful to release the packages on hackage. For my purposes, it would mean that I can stop locally packing everything on nix and I can start to push nix support upstream.
For example,
add :: Int -> Int -> Int
add x y = undefined
In the first line add
wouldn't have a reference.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.