jtablesaw / tablesaw Goto Github PK
View Code? Open in Web Editor NEWJava dataframe and visualization library
Home Page: https://jtablesaw.github.io/tablesaw/
License: Apache License 2.0
Java dataframe and visualization library
Home Page: https://jtablesaw.github.io/tablesaw/
License: Apache License 2.0
In loading a very large file from disk using .saw, it seemed like the encoding process took most of the time. The file should be written encoded as it is in memory, and read back the same way, avoiding as much work in building the dictionary as possible.
This will have the positive side effect of significantly reducing disk space requirements for category columns
avoid duplicating the data during group operations
for auto translation between miles and meters, pounds and kilograms, etc.
In addition to loading from csv. Tablesaw should allow users to populate a table from the result of a JDBC query against a relational database like PostgreSQL.
Right now this is a fixed set of values. It should be more flexible
should return a primitive hashSet as appropriate to the type, or a hashSet for categories
when compared in a case-insensitive way.
This will require changes to Table and also on CSV IO (results sets (at least those originating in an RDBMS) must have unique names).
The user should be able to "difference" a temporal column. The function would be an instance method on the column. The template below is for an IntColumn version. The temporal version would take a TimeUnit argument.
/**
See the attached pdf for an example of the output.
LocalDateColumn implements IntIterable so you can iterate on the packed date values. For user friendliness, it should instead implement Iterable and provide a getIntIterator method to get the packed version.
The current index implementation takes too long to perform an index operation on a large int column. Indexing times vary but are in the 7 to 8 minute range on a 500 million record column.
The Saw data store uses Snappy for compression by default. It could possibly use type specific compression that was both smaller and faster (Roaring Bitmaps for Booleans, FastPFOR for Ints). Bitmaps could double as the in-memory representation, eliminating translation overhead.
to make them easier to understand and use from a REPL
It's possible that both boolean and integer columns could use specialized compression that allowed operations on the data in compressed format. Bools could use Roaring Bitmaps for example. Integers could use integer compression (library from groupon?). Ints are especially important as they're used for dates, times, and categorical data, as well as for IntColumns.
to minimize memory requirements of indexes
Currently, groups are formed by unique combinations of column values. It should be possible to group on the result of functions called on column values - for example, directly grouping on a month value extracted from a date column, without having to create a month column and use that.
Write an importer that simply streams data from CSV direct to .saw output, so that interactive use of the data can proceed quickly
public int row(int r) {
return r;
}
CSV imports are orders of magnitude slower than Saw file loads. Try to speed things up by opening multiple importers and loading separate columns with each. (ActivityMonitor suggests that CSV imports are bound on CPU, and that it only uses one currently).
sequence data (including dates)
random data (by distribution)
code is in LocalDate, TypeUtils.
The folder written to is the expected folder concatenated with the expected folder, thus:
expected:
/foo/bar/bam/table.saw
is written to
/foo/bar/bam/foo/bar/bam/table.saw
like so:
/*
http://www.apache.org/licenses/LICENSE-2.0
For an integer column, the following operations can all be implemented on top of indexes, rather than on the column itself, and so should be more efficient:
sum()
mean()
median() (and all other percentile operations)
max()/min()
equalTo(), greaterThan(), atLeast(), atMost(), lessThan()
between() and all variations on between();
standardDeviation
variance()
histogram binning?
either from Spark or Smile?
using index results, you can just divide the keys, and reassemble the results in order.
may require a different approach for standard queries if we are to maintain result orders ala kdb.
for example, instead of just
Table newTable = table.select(String... columnNames).where(aFilter);
it should be possible to say something like:
table.select(dateColumn.month(), dateColumn.year, sales).where(aFilter);
I can't see one. If there isn't, I think we should add a way of copying both columns and tables. Otherwise you add a column form one Table to another, modify that column, and screw up the original Table.
Clearly with the big-data type workflows, you want to avoid copying but Table will also be useful for many cases where people are dealing with much smaller things, for instance a Table of summary statistics that you want to save as csv. Calling copy in these cases won't be an issue.
The other approach is to make Column immutable and so you don't have to defensively copy. It may be too late for that here.
here is an example:
` public FloatColumn round() {
FloatColumn newColumn = create(this.name() + "[rounded]");
for(int r = 0; r < this.size(); ++r) {
float value = this.data.getFloat(r);
newColumn.set(r, (float)Math.round(value));
}
return newColumn;
}`
the set fails because the length of newColumn is 0.
so reading CSV doesn't require columns to be pre-defined, which is tedious for wide files
It would mitigate sharing issues and make several table ops simpler, including the use of TemporaryView.
Sorting currently defaults to ascending order:
t.sortOn("column1", "column2");
sorts in ascending order based on the values in the two given columns:
to specify descending sorts, you have to use:
t.sortDescendingOn:("column1);
If you want to mix ascending and descending so that you sorted the tallest first by age, starting with the youngest, you have to construct a Sort object and pass that to a specialized sortOn: method. It should be possible to simply write:
sortOn("age", "-height"); // by age starting with the youngest, then by height starting with the tallest.
This can be implemented by parsing the column names and constructing a Sort object behind the scenes
implement auto indexing to speed search. indexing could occur on load or on first query that uses a particular column.
Wrap it with an interface that can have alternate implementations, e.g. ones backed by IntArrayList or an int[].
currently only counts and their derivative proportions are supported
java's boolean type doesn't offer much support for dealing with missing values. Convert the column to use byte instead of boolean and use Byte.MIN_VALUE as the missing value.
Make sure that the toDoubleArray() method handles missing values correctly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.