jtablesaw / tablesaw Goto Github PK

View Code? Open in Web Editor NEW

3.4K 143.0 622.0 64.62 MB

Java dataframe and visualization library

Home Page: https://jtablesaw.github.io/tablesaw/

License: Apache License 2.0

Java 99.43% HTML 0.57% Shell 0.01%

dataframe data-frame java java-dataframe visualization plotting statistics chart machine-learning statistical-analysis

tablesaw's People

Contributors

Stargazers

Watchers

Forkers

onvovi epplestun oraclewalid medmatix jupitertechnologies awclives cybernetics konglingyuan vincent-chenxl yjydmlh cloudoo shangshujie365 yinxx techscientist mengyou658 dut3062796s abrid bings xiaocan1 blake2002 rootzhongfengshan liuxiaoqiang jcoderzy jxjhill ktaranov kevinkelley richiethom szulijun faisal-w zzzzzmh bryanshubo leslykay nikolalucic cw2030 vorsicht dlsyaim fxkulou jcrotinger hai8108 lllding linghushaoxia houlibao chenarthur tfgzs socube antnewman liaocong cymbidum imxtyler rocbin daltonwang wpeiguang dafei1288 qmgeng xiaogang150805 liuyuneu dyq168 biggie-wu victaie eliid rapidark askxionghu fuyao-jiang liquanjin endlesshh douglasarantes six999999 laotulv jujuezhe123 eastsun wthuan changmingivy elitewang zrbsprite felicityrooting what1sc0de stonefir zach14c ghc312 yeyinzhu321 alainlompo gzhugo shotishu winggyn shicaid vkhokhla johanra lanyue52011 z-hong redcodes prasannanatarajan javaandjavaweb dogdogtoto vangogh-ken hustdc miguelvm felixvo quyf wangzj00 hpp0716

tablesaw's Issues

Store Category columns in .saw format with dictionary encoding intact

In loading a very large file from disk using .saw, it seemed like the encoding process took most of the time. The file should be written encoded as it is in memory, and read back the same way, avoiding as much work in building the dictionary as possible.

This will have the positive side effect of significantly reducing disk space requirements for category columns

Publish to Maven Central

Test table#append() thoroughly

Replace TableGroups with a version based on Views

avoid duplicating the data during group operations

Feature: Integration with Units of Measure Library

for auto translation between miles and meters, pounds and kilograms, etc.

Add integration to load data from relational databases

In addition to loading from csv. Tablesaw should allow users to populate a table from the result of a JDBC query against a relational database like PostgreSQL.

CsvReader should load data from URLs via http as well as form local files

Add the ability to specify Missing-Value strings on CSV import

Right now this is a fixed set of values. It should be more flexible

Add asSet() method to all column types

should return a primitive hashSet as appropriate to the type, or a hashSet for categories

Ensure that the names of columns are unique

when compared in a case-insensitive way.

This will require changes to Table and also on CSV IO (results sets (at least those originating in an RDBMS) must have unique names).

Add difference function for numeric and date-time cols

The user should be able to "difference" a temporal column. The function would be an instance method on the column. The template below is for an IntColumn version. The temporal version would take a TimeUnit argument.

/**

Returns a new column of the same type as the receiver, such that the values in the new column
would contain the difference between each cell in the original and it's predecessor.
The value for the first cell in the new column would be the missing value indicator for that column
(e.g. IntColumn.MISSING_VALUE)
*/
IntColumn difference() {..... }

See the attached pdf for an example of the output.

Fix iteration on LocalDate columns and check other columns

LocalDateColumn implements IntIterable so you can iterate on the packed date values. For user friendliness, it should instead implement Iterable and provide a getIntIterator method to get the packed version.

Change build process to produce a single shaded jar

Add other CSV-based constructors to Table with all options supported

Indexing takes too long

The current index implementation takes too long to perform an index operation on a large int column. Indexing times vary but are in the 7 to 8 minute range on a 500 million record column.

Enhance .saw format compression

The Saw data store uses Snappy for compression by default. It could possibly use type specific compression that was both smaller and faster (Roaring Bitmaps for Booleans, FastPFOR for Ints). Bitmaps could double as the in-memory representation, eliminating translation overhead.

Refactor the column utils

to make them easier to understand and use from a REPL

Evaluate use of specialized compression for in-memory data

It's possible that both boolean and integer columns could use specialized compression that allowed operations on the data in compressed format. Bools could use Roaring Bitmaps for example. Integers could use integer compression (library from groupon?). Ints are especially important as they're used for dates, times, and categorical data, as well as for IntColumns.

Evaluate the use of bitmaps and int-specific compression for indexing

to minimize memory requirements of indexes

Enable group formation from arbitrary (or nearly arbitrary) funtion results

Currently, groups are formed by unique combinations of column values. It should be possible to group on the result of functions called on column values - for example, directly grouping on a month value extracted from a date column, without having to create a month column and use that.

Feature: Streaming imports from CSV to .saw

Write an importer that simply streams data from CSV direct to .saw output, so that interactive use of the data can proceed quickly

remove unneeded row(int) method from table. it does nothing

public int row(int r) {
return r;
}

Feature: Integration with charting library for EDA

Parallel CSV file imports

CSV imports are orders of magnitude slower than Saw file loads. Try to speed things up by opening multiple importers and loading separate columns with each. (ActivityMonitor suggests that CSV imports are bound on CPU, and that it only uses one currently).

Add tests for all float functions

add support to fill columns

sequence data (including dates)
random data (by distribution)

Add date parsing performance improvements to LocalTime and LocalDateTime columns

code is in LocalDate, TypeUtils.

In ObservationDataTest saw is written in wrong folder

The folder written to is the expected folder concatenated with the expected folder, thus:

expected:
/foo/bar/bam/table.saw
is written to
/foo/bar/bam/foo/bar/bam/table.saw

Add the ability to subdivide tables on column values, independent of the aggregation logic

Add Apache license text to every file

like so:

Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

Convert operations to use indexes when they are present

For an integer column, the following operations can all be implemented on top of indexes, rather than on the column itself, and so should be more efficient:

sum()
mean()
median() (and all other percentile operations)
max()/min()
equalTo(), greaterThan(), atLeast(), atMost(), lessThan()
between() and all variations on between();
standardDeviation
variance()
histogram binning?

Feature: Integration with Machine Learning Library

either from Spark or Smile?

enable parallel query execution

using index results, you can just divide the keys, and reassemble the results in order.

may require a different approach for standard queries if we are to maintain result orders ala kdb.

It should be possible to select a column derived from a function in a query

for example, instead of just

Table newTable = table.select(String... columnNames).where(aFilter);

it should be possible to say something like:

table.select(dateColumn.month(), dateColumn.year, sales).where(aFilter);

add correlation as a reduce function over two columns

is there no way to copy columns or tables?

I can't see one. If there isn't, I think we should add a way of copying both columns and tables. Otherwise you add a column form one Table to another, modify that column, and screw up the original Table.

Clearly with the big-data type workflows, you want to avoid copying but Table will also be useful for many cases where people are dealing with much smaller things, for instance a Table of summary statistics that you want to save as csv. Calling copy in these cases won't be an issue.

The other approach is to make Column immutable and so you don't have to defensively copy. It may be too late for that here.

lots of functions in FloatColumn seem to be broken

here is an example:
` public FloatColumn round() {
FloatColumn newColumn = create(this.name() + "[rounded]");

    for(int r = 0; r < this.size(); ++r) {
        float value = this.data.getFloat(r);
        newColumn.set(r, (float)Math.round(value));
    }

    return newColumn;
}`

the set fails because the length of newColumn is 0.

Feature: Autodetect CSV column type

so reading CSV doesn't require columns to be pre-defined, which is tedious for wide files

Consider making columns immutable

It would mitigate sharing issues and make several table ops simpler, including the use of TemporaryView.

Change the sort api to allow the use of "-columnName" to indicate a descending sort

Sorting currently defaults to ascending order:
t.sortOn("column1", "column2");
sorts in ascending order based on the values in the two given columns:

to specify descending sorts, you have to use:
t.sortDescendingOn:("column1);

If you want to mix ascending and descending so that you sorted the tallest first by age, starting with the youngest, you have to construct a Sort object and pass that to a specialized sortOn: method. It should be possible to simply write:
sortOn("age", "-height"); // by age starting with the youngest, then by height starting with the tallest.

This can be implemented by parsing the column names and constructing a Sort object behind the scenes

jtablesaw / tablesaw Goto Github PK

tablesaw's People

Contributors

Stargazers

Watchers

Forkers

tablesaw's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs