Error: cannot allocate vector of size 26.6 Gb


This is a very common error encountered if you are applying any ML algorithm without proper data engineering. The value of file size may vary with your data set.

Reason:

The main reasons of such errors are if any of the numerical data type variables has some characters in some of the records. A column in R data frame supports only one data type, if there are records of multiple data types in the one column than R default coerce the data type to the lowest data type possible. Example to illustrate are below:

# consider a dataframe below
df <- data.frame(“Col1” = c(1, 2, 3, 4), “Col2” = c(“a”, “b”, “c”, “d”),
“Col3” = c(1, 2, “z”, 4), “Col4” = c(1, 2, TRUE, FALSE),
“Col5” = c(1, 0, TRUE, FALSE), “Col6” = c(“A”, TRUE, 1, “1_x”),
“Col7” = c(T, T, F, F), stringsAsFactors = F)
df

##   Col1 Col2 Col3 Col4 Col5 Col6  Col7
## 1    1    a    1    1    1    A  TRUE
## 2    2    b    2    2    0 TRUE  TRUE
## 3    3    c    z    1    1    1 FALSE
## 4    4    d    4    0    0  1_x FALSE

To check the data type of each column I am using the str() function

str(df)

## ‘data.frame’:    4 obs. of  7 variables:
##  $ Col1: num  1 2 3 4
##  $ Col2: chr  “a” “b” “c” “d”
##  $ Col3: chr  “1” “2” “z” “4”
##  $ Col4: num  1 2 1 0
##  $ Col5: num  1 0 1 0
##  $ Col6: chr  “A” “TRUE” “1” “1_x”
##  $ Col7: logi  TRUE TRUE FALSE FALSE

It is clearly visible that if the stringsAsFactors parameter set as false then Col3 which has numbers and characters both is coerced to “chr”, Col6 has all the datatypes are also coerced to “chr” and Col4 and Col5 has logical and numerical and logical TRUE == 1 and FALSE == 0 hence it was converted to “num” data type. However, if stringsAsFactors is not set then it takes its default value as TRUE and converts all the charter data types to factor.

df <- data.frame(“Col1” = c(1, 2, 3, 4), “Col2” = c(“a”, “b”, “c”, “d”),
“Col3” = c(1, 2, “z”, 4), “Col4” = c(1, 2, TRUE, FALSE),
“Col5” = c(1, 0, TRUE, FALSE), “Col6” = c(“A”, TRUE, 1, “1_x”),
“Col7” = c(T, T, F, F))
str(df)

## ‘data.frame’:    4 obs. of  7 variables:
##  $ Col1: num  1 2 3 4
##  $ Col2: Factor w/ 4 levels “a”,”b”,”c”,”d”: 1 2 3 4
##  $ Col3: Factor w/ 4 levels “1”,”2″,”4″,”z”: 1 2 4 3
##  $ Col4: num  1 2 1 0
##  $ Col5: num  1 0 1 0
##  $ Col6: Factor w/ 4 levels “1”,”1_x”,”A”,..: 3 4 1 2
##  $ Col7: logi  TRUE TRUE FALSE FALSE

Now when there are a large number of observation in the data frame and even if one record has character the whole column is converted into factors. And when any algorithm is applied to the dataset, it requires huge amount of space as the algorithm is considering each different value of continous numeric observation as separate factor.

Solution:

Before applying any algorithm please check the data type of training set using str() function to identify if any of the numeric column is not showing Factor as datatype. If there are any then correct it either by removing those observations of correct those observations.

Hope this is helpful. Thank you.

Add a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.