public class Frame extends Lockable<Frame>
Vec
s, essentially an R-like Distributed Data Frame.
Frames represent a large distributed 2-D table with named columns
(Vec
s) and numbered rows. A reasonable column limit is
100K columns, but there's no hard-coded limit. There's no real row
limit except memory; Frames (and Vecs) with many billions of rows are used
routinely.
A Frame is a collection of named Vecs; a Vec is a collection of numbered
Chunk
s. A Frame is small, cheaply and easily manipulated, it is
commonly passed-by-Value. It exists on one node, and may be
stored in the DKV
. Vecs, on the other hand, must be stored in the
DKV
, as they represent the shared common management state for a collection
of distributed Chunks.
Multiple Frames can reference the same Vecs, although this sharing can
make Vec lifetime management complex. Commonly temporary Frames are used
to work with a subset of some other Frame (often during algorithm
execution, when some columns are dropped from the modeling process). The
temporary Frame can simply be ignored, allowing the normal GC process to
reclaim it. Such temp Frames usually have a null
key.
All the Vecs in a Frame belong to the same Vec.VectorGroup
which
then enforces Chunk
row alignment across Vecs (or at least enforces
a low-cost access model). Parallel and distributed execution touching all
the data in a Frame relies on this alignment to get good performance.
Example: Make a Frame from a CSV file:
File file = ... NFSFileVec nfs = NFSFileVec.make(file); // NFS-backed Vec, lazily read on demand Frame fr = water.parser.ParseDataset.parse(Key.make("myKey"),nfs._key);
Example: Find and remove the Vec called "unique_id" from the Frame, since modeling with a unique_id can lead to overfitting:
Vec uid = fr.remove("unique_id");
Example: Move the response column to the last position:
fr.add("response",fr.remove("response"));
Modifier and Type | Class and Description |
---|---|
static class |
Frame.CSVStream |
static class |
Frame.CSVStreamParams |
static class |
Frame.DeepSelect
Last column is a bit vec indicating whether or not to take the row.
|
class |
Frame.FrameVecRegistry |
static class |
Frame.VecSpecifier
Pair of (column name, Frame key).
|
Modifier and Type | Field and Description |
---|---|
java.lang.String[] |
_names
Vec names
|
Constructor and Description |
---|
Frame(Frame fr)
Deep copy of Vecs and Keys and Names (but not data!) to a new random Key.
|
Frame(Key<Frame> key)
Creates an empty frame with given key.
|
Frame(Key<Frame> key,
java.lang.String[] names,
Vec[] vecs)
Creates a frame with given key, names and vectors.
|
Frame(Key<Frame> key,
Vec[] vecs)
Creates a frame with given key, default names and vectors.
|
Frame(java.lang.String[] names,
Vec[] vecs)
Creates an internal frame composed of the given Vecs and names.
|
Frame(Vec... vecs)
Creates an internal frame composed of the given Vecs and default names.
|
Modifier and Type | Method and Description |
---|---|
Frame |
add(Frame fr)
Append a Frame onto this Frame.
|
void |
add(java.lang.String[] names,
Vec[] vecs) |
void |
add(java.lang.String[] names,
Vec[] vecs,
int cols) |
Vec |
add(java.lang.String name,
Vec vec)
Append a named Vec to the Frame.
|
Vec |
anyVec()
Returns the first readable vector.
|
Vec[] |
bulkRollups() |
long |
byteSize()
The
Vec.byteSize of all Vecs |
int[] |
cardinality()
Number of categorical levels for categorical columns; -1 for non-categorical columns.
|
protected long |
checksum_impl(boolean noCache)
64-bit checksum of the checksums of the vecs.
|
Frame |
deepCopy(java.lang.String keyName)
Create a copy of the input Frame and return that copied Frame.
|
Frame |
deepSlice(java.lang.Object orows,
java.lang.Object ocols)
In support of R, a generic Deep Copy and Slice.
|
static java.lang.String |
defaultColName(int col)
Default column name maker
|
Frame |
delete_and_lock(Key<Job> job_key) |
static void |
deleteTempFrameAndItsNonSharedVecs(Frame tempFrame,
Frame baseFrame)
Given a temp Frame and a base Frame from which it was created, delete the
Vecs that aren't found in the base Frame and then delete the temp Frame.
|
java.lang.String[][] |
domains()
All the domains for categorical columns; null for non-categorical columns.
|
static Job |
export(Frame fr,
java.lang.String path,
java.lang.String frameName,
boolean overwrite,
int nParts) |
static Job |
export(Frame fr,
java.lang.String path,
java.lang.String frameName,
boolean overwrite,
int nParts,
boolean parallel,
java.lang.String compression,
Frame.CSVStreamParams csvParms) |
static Job |
export(Frame fr,
java.lang.String path,
java.lang.String frameName,
boolean overwrite,
int nParts,
java.lang.String compression,
Frame.CSVStreamParams csvParms) |
static Job |
exportParquet(Frame fr,
java.lang.String path,
boolean overwrite,
java.lang.String compression,
boolean writeChecksum,
boolean tzAdjustFromLocal) |
Frame |
extractFrame(int startIdx,
int endIdx)
Split this Frame; return a subframe created from the given column interval, and
remove those columns from this Frame.
|
static Frame[] |
fetchAll()
Fetch all Frames from the KV store.
|
int |
find(Key key)
Deprecated.
as many columns in a Frame could be backed by the same Vec (and its key), we can't return single column index that corresponds to a given
key .
Please use find(String) instead. |
int |
find(java.lang.String name)
Finds the column index with a matching name, or -1 if missing
|
int[] |
find(java.lang.String[] names)
Bulk
find(String) api |
int |
find(Vec vec)
Deprecated.
as many columns in a Frame could be backed by the same Vec, we can't return single column index that corresponds to a given
vec .
Please use find(String) instead. |
Frame.FrameVecRegistry |
frameVecRegistry()
A structure for fast lookup in the set of frame's vectors.
|
boolean |
hasInfs() |
boolean |
hasNAs() |
void |
insertVec(int i,
java.lang.String name,
Vec vec) |
boolean |
isCompatible(Frame fr)
Frames are compatible if they have the same layout (number of rows and chunking) and the same vector group (chunk placement)..
|
Key<Vec>[] |
keys()
The array of keys.
|
java.lang.Iterable<Key<Vec>> |
keysList() |
Vec |
lastVec()
Convenience to accessor for last Vec
|
java.lang.String |
lastVecName()
Convenience to accessor for last Vec name
|
Vec[] |
makeCompatible(Frame f) |
Vec[] |
makeCompatible(Frame f,
boolean force)
Return array of Vectors if 'f' is compatible with 'this', else return a new
array of Vectors compatible with 'this' and a copy of 'f's data otherwise.
|
java.lang.Class<KeyV3.FrameKeyV3> |
makeSchema() |
Frame |
makeSimilarlyDistributed(Frame f,
Key<Frame> newKey)
Make rows of a given frame distributed similarly to this frame.
|
double[] |
means()
All the column means.
|
int[] |
modes()
Majority class for categorical columns; -1 for non-categorical columns.
|
void |
moveFirst(int[] cols)
move the provided columns to be first, in-place.
|
double[] |
mults()
One over the standard deviation of each column.
|
long |
naCount() |
double |
naFraction() |
java.lang.String |
name(int i)
A single column name.
|
java.lang.String[] |
names()
The array of column names.
|
int |
numCols()
Number of columns
|
long |
numRows()
Number of rows
|
Futures |
postWrite(Futures fs)
Allow rollups for all written-into vecs; used by
MRTask once
writing is complete. |
Frame |
prepend(java.lang.String name,
Vec vec)
Insert a named column as the first column
|
protected Keyed |
readAll_impl(AutoBuffer ab,
Futures fs) |
Vec[] |
reloadVecs()
Force a cache-flush and reload, assuming vec mappings were altered
remotely, or that the _vecs array was shared and now needs to be a
defensive copy.
|
protected Futures |
remove_impl(Futures fs,
boolean cascade)
Actually remove/delete all Vecs from memory, not just from the Frame.
|
Vec |
remove(int idx)
Removes a numbered column.
|
Vec[] |
remove(int[] idxs)
Removes a list of columns by index; the index list must be sorted
|
Vec |
remove(java.lang.String name)
Removes the column with a matching name.
|
Frame |
remove(java.lang.String[] names) |
Vec[] |
removeAll()
Remove all the vecs from frame.
|
void |
reOrder(int[] newOrder)
Re-order the columns according to the new order specified in newOrder.
|
Vec |
replace(int col,
Vec nv)
Replace one column with another.
|
void |
restructure(java.lang.String[] names,
Vec[] vecs)
Restructure a Frame completely
|
void |
restructure(java.lang.String[] names,
Vec[] vecs,
int cols)
Restructure a Frame completely, but only for a specified number of columns (counting up)
|
Futures |
retain(Futures futures,
java.util.Set<Key> retainedKeys)
|
void |
setNames(java.lang.String[] columns) |
Frame |
sort(int[] cols)
Sort rows of a frame, using the set of columns as keys.
|
Frame |
sort(int[] cols,
int[] ascending) |
Frame |
subframe(int startIdx,
int endIdx)
Create a subframe from given interval of columns.
|
Frame |
subframe(java.lang.String[] names)
Returns a subframe of this frame containing only vectors with desired names.
|
void |
swap(int lo,
int hi)
Swap two Vecs in-place; useful for sorting columns by some criteria
|
Frame |
toCategoricalCol(int columIdx)
Returns the original frame with specific column converted to categorical
|
Frame |
toCategoricalCol(java.lang.String column)
Returns the original frame with specific column converted to categorical
|
java.io.InputStream |
toCSV(Frame.CSVStreamParams parms)
Convert this Frame to a CSV (in an
InputStream ), that optionally
is compatible with R 3.1's recent change to read.csv()'s behavior. |
java.lang.String |
toString() |
java.lang.String |
toString(long off,
int len) |
java.lang.String |
toString(long off,
int len,
boolean rollups) |
TwoDimTable |
toTwoDimTable() |
TwoDimTable |
toTwoDimTable(long off,
int len) |
TwoDimTable |
toTwoDimTable(long off,
int len,
boolean rollups) |
byte[] |
types()
Type for every Vec
|
java.lang.String[] |
typesStr()
String name for each Vec type
|
java.lang.String |
uniquify(java.lang.String name) |
Vec |
vec(int idx)
Returns the Vec by given index, implemented by code:
vecs()[idx] . |
Vec |
vec(java.lang.String name)
Return a Vec by name, or null if missing
|
Vec[] |
vecs()
The internal array of Vecs.
|
Vec[] |
vecs(int[] idxs) |
Vec[] |
vecs(java.lang.String[] names) |
protected AutoBuffer |
writeAll_impl(AutoBuffer ab)
Write out K/V pairs, in this case Vecs.
|
delete_and_lock, delete_and_lock, delete_and_lock, delete, delete, delete, delete, read_lock, read_lock, read_lock, unlock_all, unlock, unlock, unlock, unlock, update, update, update, write_lock_to_read_lock, write_lock, write_lock, write_lock
checksum_impl, checksum, checksum, getKey, readAll, remove_impl, remove_self_key_impl, remove, remove, remove, remove, remove, remove, removeQuietly, writeAll
asBytes, clone, copyOver, frozenType, read, readExternal, readJSON, reloadFromBytes, toJsonBytes, toJsonString, write, writeExternal, writeJSON
public Frame(Vec... vecs)
public Frame(java.lang.String[] names, Vec[] vecs)
public Frame(Key<Frame> key, Vec[] vecs)
public Frame(Key<Frame> key, java.lang.String[] names, Vec[] vecs)
public Frame(Frame fr)
public static void deleteTempFrameAndItsNonSharedVecs(Frame tempFrame, Frame baseFrame)
Scope.protect(Frame...)
or a Scope.safe(Frame...)
instead.public static Frame[] fetchAll()
public boolean hasNAs()
public boolean hasInfs()
public long naCount()
public double naFraction()
public final void setNames(java.lang.String[] columns)
public static java.lang.String defaultColName(int col)
public java.lang.String uniquify(java.lang.String name)
public boolean isCompatible(Frame fr)
public int numCols()
public long numRows()
public final Vec anyVec()
public java.lang.String[] names()
public java.lang.String name(int i)
public Key<Vec>[] keys()
public final Vec[] vecs()
DKV
.public final Vec[] vecs(int[] idxs)
public Vec[] vecs(java.lang.String[] names)
public Vec lastVec()
public java.lang.String lastVecName()
public final Vec[] reloadVecs()
public final Vec vec(int idx)
vecs()[idx]
.idx
- idx of columnnull
public Vec vec(java.lang.String name)
public int find(java.lang.String name)
@Deprecated public int find(Vec vec)
vec
.
Please use find(String)
instead.@Deprecated public int find(Key key)
key
.
Please use find(String)
instead.public int[] find(java.lang.String[] names)
find(String)
apinames
arraypublic void insertVec(int i, java.lang.String name, Vec vec)
public byte[] types()
public java.lang.String[] typesStr()
public java.lang.String[][] domains()
public int[] cardinality()
public Vec[] bulkRollups()
public int[] modes()
public double[] means()
public double[] mults()
public long byteSize()
Vec.byteSize
of all VecsVec.byteSize
of all Vecsprotected long checksum_impl(boolean noCache)
checksum_impl
in class Keyed<Frame>
public void add(java.lang.String[] names, Vec[] vecs)
public void add(java.lang.String[] names, Vec[] vecs, int cols)
public Vec add(java.lang.String name, Vec vec)
public Frame add(Frame fr)
public Frame prepend(java.lang.String name, Vec vec)
public void swap(int lo, int hi)
public void reOrder(int[] newOrder)
newOrder
- public void moveFirst(int[] cols)
public Frame subframe(java.lang.String[] names)
names
- list of vector namesjava.lang.IllegalArgumentException
- if there is no vector with desired name in this frame.public Futures postWrite(Futures fs)
MRTask
once
writing is complete.protected Futures remove_impl(Futures fs, boolean cascade)
remove_impl
in class Keyed<Frame>
public Frame delete_and_lock(Key<Job> job_key)
delete_and_lock
in class Lockable<Frame>
public final Futures retain(Futures futures, java.util.Set<Key> retainedKeys)
Frame
object and all directly linked Keyed
objects and POJOs, while retaining
the keys defined by the retainedKeys parameter. Aimed to be used for removal of Frame
objects pointing
to shared resources (Vectors, Chunks etc.) internally.
WARNING: UNSTABLE API, might be removed/replaced at any time.
protected AutoBuffer writeAll_impl(AutoBuffer ab)
writeAll_impl
in class Keyed<Frame>
protected Keyed readAll_impl(AutoBuffer ab, Futures fs)
readAll_impl
in class Keyed<Frame>
public Vec replace(int col, Vec nv)
public Frame subframe(int startIdx, int endIdx)
startIdx
- index of first column (inclusive)endIdx
- index of the last column (exclusive)public Frame extractFrame(int startIdx, int endIdx)
startIdx
- index of first column (inclusive)endIdx
- index of the last column (exclusive)public Vec remove(java.lang.String name)
public Frame remove(java.lang.String[] names)
public Vec[] remove(int[] idxs)
public final Vec remove(int idx)
public Vec[] removeAll()
public void restructure(java.lang.String[] names, Vec[] vecs)
public void restructure(java.lang.String[] names, Vec[] vecs, int cols)
public Frame deepSlice(java.lang.Object orows, java.lang.Object ocols)
Semantics are a little odd, to match R's. Each dimension spec can be:
The numbering is 1-based; zero's are not allowed in the lists, nor are out-of-range values.
public java.lang.String toString()
toString
in class java.lang.Object
public java.lang.String toString(long off, int len)
public java.lang.String toString(long off, int len, boolean rollups)
public TwoDimTable toTwoDimTable()
public TwoDimTable toTwoDimTable(long off, int len)
public TwoDimTable toTwoDimTable(long off, int len, boolean rollups)
public Frame deepCopy(java.lang.String keyName)
keyName
- Key for resulting frame. If null, no key will be given.public Vec[] makeCompatible(Frame f, boolean force)
this
s' data.f
.public Frame makeSimilarlyDistributed(Frame f, Key<Frame> newKey)
f
- frame that we want to re-distributednewKey
- key for a newly created framepublic static Job export(Frame fr, java.lang.String path, java.lang.String frameName, boolean overwrite, int nParts)
public static Job export(Frame fr, java.lang.String path, java.lang.String frameName, boolean overwrite, int nParts, java.lang.String compression, Frame.CSVStreamParams csvParms)
public static Job export(Frame fr, java.lang.String path, java.lang.String frameName, boolean overwrite, int nParts, boolean parallel, java.lang.String compression, Frame.CSVStreamParams csvParms)
public static Job exportParquet(Frame fr, java.lang.String path, boolean overwrite, java.lang.String compression, boolean writeChecksum, boolean tzAdjustFromLocal)
public java.io.InputStream toCSV(Frame.CSVStreamParams parms)
InputStream
), that optionally
is compatible with R 3.1's recent change to read.csv()'s behavior.
WARNING: Note that the end of a file is denoted by the read function
returning 0 instead of -1.public java.lang.Class<KeyV3.FrameKeyV3> makeSchema()
makeSchema
in class Keyed<Frame>
public Frame sort(int[] cols)
public Frame sort(int[] cols, int[] ascending)
public Frame.FrameVecRegistry frameVecRegistry()
Frame
's vectors.Frame.FrameVecRegistry
public Frame toCategoricalCol(int columIdx)
public Frame toCategoricalCol(java.lang.String column)