# Reading and Writing Data in R

For reading (importing) data into R following are some functions.

• source() for reading in R code files (inverse of dump)
• dget() for reading in R code files (inverse of dput)

## Writing Data to files

Following are few functions for writing (exporting) data to files.

• write.table(), and write.csv() exports data to wider range of file format including csv and tab-delimited.
• writeLines() write text lines to a text-mode connection.
• dump() takes a vector of names of R objects and produces text representations of the objects on a file (or connection). A dump file can usually be sourced into another R session.
• dput() writes a ASCII text representation of an R object to a file (or connection), or uses one to recreate the object.
• save() writes an external representation of R objects to the specified file.

The read.table() function is one of the most commonly used function for reading data into R. It has a few important arguments.

• file, the name of a file, or a connection
• sep, a string indicating how the columns are separated
• colClasses, a character vector indicating the class of each column in the data set
• nrows, the number of rows in the dataset
• comment.char, a character string indicating the comment character
• skip, the number of lines to skip from the beginning
• stringsAsFactors, should character variables be coded as factors?

R will automatically skip lines that begin with a #, figure out how many rows there are (and how much memory needs to be allocated). R also figure out what type of variable is in each column of the table.

## Writing data files with write.table()

Following are few important arguments usually used in write.table() function.

• x, the object to be written, typically a data frame
• file, the name of the file which the data are to be written to
• sep, the field separator string
• col.names, a logical value indicating whether the column names of x are to be written along with x, or a character vector of column names to be written
• row.names, a logical value indicating whether the row names of x are to be written along with x, or a character vector of row names to be written
• na, the string to use for missing values in the data

## write.table() and write.csv() Examples

x <- data.frame(a=5, b=10, c=pi)
write.table(x, file=”data.csv”, sep=”,”)
write.table(x, “c:\\mydata.txt”, sep=”\t”)
write.csv(x, file=”data.csv”)

## List in R Language

In R language, list is an object that consists of an ordered collection of objects known as its components. A list in R Language is a structured data that can have any number of any modes (types) of other structured data. That is, one can put any kind of object (like vector, data frame, character object, matrix and/ or array) into one list object.An example of list is

> x <- list(c(1,2,3,5), c(“a”, “b”, “c”, “d”), c(T, T, F, T, F), matrix(1:9, nr = 3) )

that contains 4 components, three of them are vectors (numeric, string and a logical) and one of them is matrix.

An object can also be converted to list by using as.list( ) function. For vector, the disadvantage is that each element of vector becomes a component of that list. For example,

> as.list (1: 10)

## Extract components from a list

The operator [[ ]] (double square bracket) is used to extract the components of a list. To extract the second component of list, one can write at R prompt,

> list[[2]]

Using [ ] operator return a list rather than the structured data (component of the list). The component of the list need not to be of the same mode. The components are always numbered. If x1 is the name of a list with four components, then individual components may be referred to as x1[[1]], x1[[2]], x1[[3]], and x1[[4]].

If component of a list are defined then these component can be extracted by using the name of components. For example, a list with named component is

> x1 <- list(a = c(1,2,3,5), b = c(“a”, “b”, “c”, “d”), c = c(T, T, F, T, F), d = matrix(1:9, nr = 3) )

To extract the component a, one can write

> x1$a > x1[“a”] > x1[[“a”]] To extract more than one component, one can write > x[c(1,2)] #extract component one and two > x[-1] #extract all component except 1st > x[[c(1,2)]] #extract 2nd element of component one > x[[c(2,2)]] #extract 2nd element of component two > x[[c(2:4)]] #extract all elements of component 2 to 4 ## R workspace, object and .RData file The R program’s structure is similar to the programs written in other computer languages such as C or its successors C++ and Java. However, important differences between these languages and R are (i) R has no header files, (ii) most of the declarations are implicit, (iii) there are no pointers in R, and (iv) text and strings as vectors can be defined and manipulated directly. R is a functional language. Most of the computation in R is handled using functions. The R language environment is designed to facilitate the development of new scientific computation tools. Every thing (such as functions and data structure) in R is an object. Too see the names of all objects in R workspace, on R command prompt just type, >ls() objects() is an alternative to ls() function. Similarly, typing the name of any object on R prompt displays (prints) the content of that object. As an example type q, mean, and lm etc. on R prompt. It is possible to save individual object or collection of objects into a named image file. The named image file have extension of .RData. Som e possibilities to save object from R workspace are: To save content of R workspace into a file .RData, type > save.image() To save objects in file archive.RData, type > save.image(file = “archive.RData”) To save some required objects in data.RData, type > save(x, y, file = “data.RData”) These image files can be attached to make objects available in the next R session. For example. > attached (“arvhive.RData”) Note that when quitting, R offers the option of saving workspace image. By default workspace is saved in an image file (.RData) in working directory. The image file can e used in the next R session. Saving the workspace image will save everything from current workspace. Therefore, use rm() function to remove objects that are not further required in next R session. For further details about saving and loading R workspace visit: http://rfaqs.com/saving-and-loading-r-workspace ## mctest: An R package for Detection of Collinearity among Regressors The problem of multicollinearity plagues the numerical stability of regression estimates. It also causes some serious problem in validation and interpretation of the regression model. Consider the usual multiple linear regression model, $y = X \beta+u$, where $y$ is an $n\times 1$ vector of observation on dependent variable, $X$ is known design matrix of order $\times p$, having full-column rank $p$, $\beta$ is $p \times 1$ vector of unknown parameters and $u$ is an $n\times 1$ vector of random errors with mean zero and variance $\sigma^2 I_n$, where $I_n$ is an identity matrix of order $n$. Existence of linear dependence (relationship) between regressors can affect the regression model ability to estimate the model’s parameters (regression coefficients). Therefore, multicollinearity is lack of independence or the presence of interdependence signified by usually high inter-correlations within a set of regressors (predictors). In case of sever multicollinearity (ill-conditioning) of $X'X$ matrix, implausible signs, low t-ratios, high R-squared values, inflated standard errors, wider confidence intervals, very large condition number (CN) and non-significant and/or magnitude of regression coefficient estimates are some of possible issues. There are many diagnostic methods are available to check the existence of collinearity among regressors, such as variance inflation Factor (VIF), values of pair-wise correlation among regressors, eigenvalues, CN, Farrar and Glauber tests, Theil’s measure, klein’s rule etc. Our recently developed R package mctest computes several diagnostic measures to test the existence of collinearity among regressors. We classified these measures as individual collinearity diagnostic and overall collinearity diagnostics. Overall collinearity diagnostic include determinant of $X'X$ matrix, red indicator, Farrar Chi-Square test, Theil indicator, CN, and sum of lambda inverse values. Individual collinearity diagnostics include VIF/ TOL, Farrar and Glaube Wi test, relationship between$R^2$and F-test, corrected VIF (CVIF) and Klein’s rule. ## How to use mctest package You must have installed and load the mctest package to start with testing of collinearity among regressors. As an example, we used Hald data which is already bundled in mctest package. mctest package have 4 functions namely, mctest(), omcdiag(), imcdiag() and mc.plot() functions. The mctest() function can be used to have overall and/or individual collinearity diagnostic. The mc.plot() can be used to draw graph of VIF and eigenvalues to have graphical judgement of among collinearity among regressors. mctest illustrative Example The argument of mctest is mctest(x, y, type = c(“o”, “I”, “b”), na.rm = TRUE, Inter = TRUE, method = NULL, corr = FALSE, detr = 0.01, red = 0.5, theil = 0.5, cn = 30, vif = 10, tol = 0.1, conf = 0.95, cvif = 10, leamer = 0.1, all = all) For detail of each argument see the mctest package documentation. Following are few commands that can be used get different collinearity diagnostics. x<-Hald[ ,-1] # X variables from Hald data > y<-Hald[ ,1] # y variable from Hald data > mctest(x, y) # default collinearity diagnostics > mctest(x, y, type = “i”) # individual collinearity diagnostics > mctest(x, y, type = “o”) # overall collinearity diagnostics ## Overall collinearity diagnostics For overall collinearity diagnostics, eigenvalues and condition numbers are also produced either intercept term is included or not. The syntax of omcdiag() function is omcdiag(x, y, na.rm = TRUE, Inter = True, detr = 0.01, red = 0.5, conf = 0.95, theil = 0.5, cn = 30, …) Determinant of correlation matrix, Farrar test of Chi-square, Red indicator, sum of lambda inverse values, Theils’ indicator and CN. > omcdiag(x, y, Inter=FALSE) > omcdiag(x, y)[1] > omcidag(x,y, detr=0.001, conf=0.99) The output of last command (with threshold for determinant and confidence interval for Farrar and Glauber test). ## Individual collinearity diagnostics imcdiag(x, y, method = NULL, na.rm = TRUE, corr = FALSE, vif = 10, tol = 0.1, conf = 0.95, cvif = 10, leamer = 0.1, all = all) The imcdiag() function detects the existence of multicollinearity due to certain X-variable. This includes VIF, TOL, Klein’s rule, CVIF, F&G test of Chi-square and F-test. > imcdiag(x = x, y) > imcdiag(x = x, y, corr = TRUE) # correlation matrix > imcdiag(x = x, y, vif = 5, leamer = 0.05) # with threshold of VIF and leamer method > imcdiag(x = x, y, all = True) > imcdiag(x = x, y, all = TRUE, vif = 5, leamer = 0.2, cvif = 5) ## Graphical representation of VIF and Eigenvalues > mc.plot(x, y, Inter = FALSE, vif = 10, ev = 0.01) > mc.plot(x, y) > mc.plot(x, y, vif = 5, ev = 0.2) For further detail about collinearity diagnostic see ## Statistical Models in R Language R language provides an interlocking suite of facilities that make fitting statistical models very simple. The output from statistical models in R language is minimal and one needs to ask for the details by calling extractor functions. Defining Statistical Models; Formulae in R Language The template for a statistical model is a linear regression model with independent, heteroscedastic errors, that is $\sum_{j=0}^p \beta_j x_{ij}+ e_i, \quad e_i \sim NID(0, \sigma^2), \quad i=1,2,\dots, n, j=1,2,\cdots, p$ In matrix form, statistical model can be written as $y=X\beta+e$, where the $y$ is the dependent (response) variable, $X$ is the model matrix or design matrix (matrix of regressors) and has columns $x_0, x_1, \cdots, x_p$, the determining variables with intercept term. Usually $x_0$ is a column of ones defining an intercept term in statistical model. Statistical Model Examples Suppose $y, x, x_0, x_1, x_2, \cdots$ are numeric variables, $X$ is a matrix. Following are some examples that specify statistical models in R. • y ~ x or y ~ 1 + x Both examples imply the same simple linear regression model of $y$ on $x$. The first formulae has an implicit intercept term and the second formulae has an explicit intercept term. • y ~ 0 + x or y ~ -1 + x or y ~ x – 1 All these imply the same simple linear regression model of $y$ on $x$ through the origin, that is, without an intercept term. • log(y) ~ x1 + x2 Imply multiple regression of the transformed variable,$latex(log(y)$on $x_1$ and $x_2$ with an implicit intercept term. • y ~ poly(x , 2) or y ~ 1 + x + I(x, 2) Imply a polynomial regression model of$latex$y on$ latex x\$ of degree 2 (second degree polynomials) and the second formulae uses explicit powers as basis.
• y~ X + poly(x, 2)
Multiple regression $y$ with model matrix consisting of the design matrix $X$ as well as polynomial terms in $x$ to degree 2.

Note that the operator ~ is used to define a model formula in R language. The form of an ordinary linear regression model is, $response\,\, ~ \,\, op_1\,\, term_1\,\, op_2\,\, term_2\,\, op_3\,\, term_3\,\, \cdots$,

where

response is a vector or matrix defining the response (dependent) variable(s).
$op_i$ is an operator, either + or -, implying the inclusion or exclusion of a term in the model. The + operator is optional.
$term_i$ is either a matrix or vector or 1. It may be a factor or a formula expression consisting of factors, vectors or matrices connected by formula operators.