loadData {GCDkit}R Documentation

Loading data into GCDkit

Description

Loads data from a file (or, alternatively, a clipboard) into GCDkit. The files may contain plain text, or, if library RODBC (has been installed, can be in the dBase III/IV (*.dbf), Excel (*.xls), Access (*.mdb), PetroGraph (*.peg), IgPet or NewPet (*.roc) formats.

Usage

loadData(filename=NULL,separators = c("\t", ",", ";"," "), 
na.strings = c("NA","-","bd", "b.d.", "bdl", "b.d.l.", "N.A.","n.d."), 
clipboard = FALSE, merging = FALSE); 

loadDataOdbc(filename=NULL,na.strings=c("NA","-", "bd", 
"b.d.", "bdl", "b.d.l.", "N.A.","n.d."),merging=FALSE,
ODBC.choose=TRUE)

Arguments

filename

fully qualified name of the file to be loaded, including suffix.

separators

strings that should be tested as prospective delimiters separating individual items in the data file.

na.strings

strings that will be interpreted, together with empty items, zeros and negative numbers, as missing values (NA).

clipboard

logical; is clipboard to be read instead of a file?

merging

logical; is the function invoked during merging of two data files?

ODBC.choose

logical; if TRUE, ODBC channel can be chosen interactively.

Details

If library RODBC is available, the functions attempt to establish an ODBC connection to the selected file, and open it as dBase III/IV (*.dbf), Excel (*.xls) or Access (*.mdb) format. The DBF files are used to store data by other popular geochemical packages, such as IgPet (Carr, 1995) or MinPet (Richard, 1995).

Another format that can be imported is *.csv. It is employed by geochemical database systems such as GEOROC (http://georoc.mpch-mainz.gwdg.de/georoc/) and PETDB (http://www.petdb.org/).

The import filter for the *.csv files has been tailored to keep the structure of these databases in mind.

The package PetroGraph (Petrelli et al. 2005) saves data into *.peg files that are also, in principle, *.csv files compatible with the GCDkit.

Data files *.roc are yet another variant of *.csv files, used by NewPet (Clarke et al. 1994). This is not to be confused with the *.roc format designed for IgPet (Carr, 1995). This is a text file with a quite complex structure, whose import is still largely experimental. DBF files are to be preferred for this purpose.

If not successful, the function 'loadData' assumes that it is dealing with a simple text file.

On the other hand 'loadDataOdbc' allows an ODBC channel to be specified interactively if
'ODBC.choose=TRUE'.

Plain text files can be delimited by tabs, commas or semicolons (the delimiter is recognized automatically). Alternative separators list can be specified by the optional 'separators' parameter. The Windows clipboard is just taken as a special kind of a tab-delimited text file.

In the text file, the first line contains names for the data columns (except for the first one that is automatically assumed to contain the sample names); hence the first line may (or may not) have one item less than the following ones. The data rows start with sample name and do not have to be all of the same length (the rest of the row is filled by 'NA' automatically).

Missing values ('NA') are allowed anywhere in the data file (naturally apart from sample and column names); any of 'NA', 'N.A.', '-', 'b.d.', 'bd', 'b.d.l.','bdl' or 'n.d.' are also treated as such, as specified by the parameter na.strings.

While loading, the values '#WHATEVER!' (Excel error messages) are also replaced by 'NA' automatically.

Please note that the function 'loadDataOdbc', due to the current limitations of the RODBC package, cannot handle correctly columns of mixed numeric and textual data. In such a column all textual information is converted to 'NA' and this unfortunately concerns the sample names as well. If encountering any problems, please use import from text file or via clipboard, which are much more robust.

The negative numbers and values '< x' (used by some authors to indicate items below detection limit) can be either replaced by their half (i.e. half of the detection limit) or 'NA'. User is prompted which of these options he prefers.

Alternatively, the negative values can be viewed either as missing ('NA') or can be imported, as may be desirable for instance for stable isotope data in the delta notation.

Decimal commas, if present in text file, are converted to decimal points.

The data files can be practically freeform, i.e. no specified oxides/elements are required and no exact order of these is to be adhered to. Analyses can contain as many numeric columns as necessary, the names of oxides and trace elements are self-explanatory (e.g. "SiO2", "Fe2O3", "Rb", "Nd".

In the text files (or if pasting from clipboard), any line starting with the hash symbol ('#') is ignored and can be used to introduce comments or to prevent the given analysis from loading temporarily.

Note that names of variables are case sensitive in R. However, any of the fully upper case names of the oxides/elements that appear in the following list are translated automatically to the appropriate capitalization:

SiO2, TiO2, Al2O3, Fe2O3, FeO, MnO, MgO, CaO, Na2O, FeOt, Fe2O3t,
 
Li2O, mg#, Ac, Ag, Al, As, At, Au, Ba, Be, Bi, 

Br, Ca, Cd, Ce, Cl, Co, Cr, Cs, Cu, Dy, Er, Eu, 

Fe, Ga, Gd, Ge, Hf, Hg, Ho, In, Ir, La, Li, Lu, 

Mg, Mn, Mo, Na, Nb, Nd, Ne, Ni, Np, Os, Pa, Pb, 

Pd, Pm, Pr, Pt, Pu, Rb, Re, Rh, Ru, S, Sb, Sc,

Se, Si, Sm, Sn, Sr, Ta, Tb, Te, Th, Ti, Tl, Tm, 

Yb, Zn, Zr.

Total iron, if given, should be expressed either as ferrous oxide ('FeOt', 'FeOT', 'FeOtot', 'FeOTOT' or 'FeO*') or ferric oxide ('Fe2O3t', 'Fe2O3T', 'Fe2O3tot', 'Fe2O3TOT' or 'Fe2O3*').

Structurally bound water can be named 'H2O.PLUS', 'H2O+', 'H2OPLUS', 'H2OP' or 'H2O_PLUS'.

Upon loading, all the completely empty columns are removed first. Any non-numeric items found in a data column with one of the names listed in the above dictionary are assumed to be typos and replaced by 'NA', after a warning appears. At the next stage all fully numeric data columns are stored in a numeric data matrix 'WR'.

For any missing major- and minor-element data (SiO2, TiO2, Al2O3, Fe2O3, FeO, MnO, MgO, CaO, Na2O, K2O, H2O.PLUS, CO2, P2O5, F, S), an empty (NA) column is created automatically.

The remaining, that is all at least partly textual data columns are transferred to the data frame 'labels'. To this are also attached a column whose name starts with 'Symbol' (if any) that is taken as containing plotting symbols and a column whose name is 'Colour' or 'Color'(if any, capitalization does not matter) that may contain plotting colours specification. The relative size of the individual plotting symbols may be specified in a column named 'Size' or 'cex' that is also to be attached to the 'labels'.

The plotting symbols can be given either by their code (see showSymbols) or directly as strings of single characters.

The colours can be specified as codes (1-49) or English names (see showColours or type 'colours()' into the Console window).

If specifications of the plotting symbols and colours are missing completely, and at least one non-numeric variable is present, the user is prompted whether he does not want to have the symbols and colours assigned automatically, from 1 to n, according to the levels of the selected label. Otherwise default symbols (empty black circles) are used.

The default grouping is set on the basis of plotting symbols '(labels$Symbol)' or the data column used to autoassign the plotting symbols and colours.

Lastly, a backup copy of the data is stored in the list 'WRCube' using the function 'pokeDataset'. It is stored either under the name of the file, or, if it already exists, under the file name with a time stamp attached.

Value

WR

numeric matrix: all numeric data

labels

data frame: all at least partly character fields; labels$Symbol contains plotting symbols and labels$Colour the plotting colours

The function prints a short summary about the loaded file. It also loads and executes the Plugins, i.e. all the R code (*.r) that is currently stored in the subdirectory '\Plugin'. Finally, the system performs some recalculations (calling 'Gcdkit.r').

Note

In order to ensure the database functionality, duplicated column (variable) names are not allowed. This concerns, to a large extent, also the sample names. The only exception are CSV files - if duplicated samples are found, sequence numbers are assigned instead.

All completely empty rows and columns in both labels and numeric data are ignored.

Author(s)

The RODBC package was written by Brian Ripley.

Vojtech Janousek, vojtech.janousek@geology.cz

References

Carr M (1995) Program IgPet. Terra Softa, Somerset, New Jersey, U.S.A.

Clarke D, Mengel F, Coish RA, Kosinowski MHF(1994) NewPet for DOS, version 94.01.07. Department of Earth Sciences, Memorial University of Newfoundland, Canada.

Petrelli M, Poli G, Perugini D, Peccerillo A (2005) PetroGraph: A new software to visualize, model, and present geochemical data in igneous petrology. Geochemistry Geophysics Geosystems 6: 1-15

Richard LR (1995) MinPet: Mineralogical and Petrological Data Processing System, Version 2.02. MinPet Geological Software, Quebec, Canada.

See Also

'saveData' 'mergeData' 'pokeDataset' 'showColours' 'showSymbols' 'read.table' 'getwd' 'setwd'

Examples

# Sets the working path and loads the 'sazava' test data set
setwd(paste(gcdx.dir,"Test_data",sep="/")) 
loadData("sazava.data")

[Package GCDkit version 4.1 Index]