ParseSVM: A Simple C++ Library to read LIBSVM files

LIBSVM files are used a lot in machine learning area to test various classification algorithms. The main repository for downloading variaus LIBSVM (SVM from here on) files is at NTU.edu.tw website

Quite a few times, In real world cases, SVM datasets are very sparse, i.e. most of the feautures for observations are blank. We we can store the SVM data in a sparse matrix. I personally am used to coding my algorithms using compressed row storage (CRS) but I do know that others prefer comressed column storage (CCS). So this library can be used to get the SVM data in both the formats. To learn more about compressed column storage and compressed row storage, one can visit Compressed Row Storage and Compressed Column Storage. A sample SVM format file looks like this:

+1 1:10 5:-2
+1 1:3 2:9 6:3
+1 2:7 3:8 4:7
-1 1:3 3:8 4:7 5:5
-1 2:8 4:9 5:9 6:13
-1 2:4 5:2 6:-1

Each line in the file corresponds to an observation. and each observation starts with a label: either +1 or -1 (we only deal with binary classification in this post although ParseSVM can also parse multiple numberical labels). The subsequqent information describes the value of the feature and the feature position number. for instance, the first observation has a label of +1 and the value of first(1) feature is 10, fifth(5) feature is -2 and all the remaining feature values are 0.

I have used two different Matrix Classes to define the compressed CRS and CCS matrices:

#ifndef Matrix_h
#define Matrix_h

#include <vector>

struct CRS_Matrix{
    
    long long nzmax;  //maximum number of entries
    int m;  //Number of rows
    int n;  //number of columns
    
    std::vector<double> values;
    std::vector<int> col_index;
    std::vector<int> row_ptr;
    
    std::vector<int> y_label;

    CRS_Matrix() = default;
    
    CRS_Matrix(int m, int n, long long nzmax) {
        this->m = m;
        this->n = n;
        this->nzmax = nzmax;
        this->row_ptr = std::vector<int>((n + 1));
        this->col_index = std::vector<int>(nzmax);
        this->values = std::vector<double >(nzmax);
    }
};

struct CCS_Matrix{
    
    long long nzmax;  //maximum number of entries
    int m;  //Number of columns
    int n;  //number of rows
    
    std::vector<double> values;
    std::vector<int> row_index;
    std::vector<int> col_ptr;
    
    std::vector<int> y_label;
    
    CCS_Matrix() = default;
    
    CCS_Matrix(int m, int n, long long nzmax) {
        this->m = m;
        this->n = n;
        this->nzmax = nzmax;
        this->col_ptr = std::vector<int>((n + 1));
        this->row_index = std::vector<int>(nzmax);
        this->values = std::vector<double >(nzmax);
    }  
};

The main reason to define them like this is to avoid the confusion of ambiguasly named row_pointers and column_indices. The CCS matrix is generated by transposing the CRS matrix. Hence it will be confusing to keep the same names for row_pointers and column_indices. Despite this, there will be some confusuion. The m abd n values in CRS indicate number of rows and number of columns respectively while in CCS indicate vice versa. I followed the instructions in Tim Davis’s excellent book Direct methods for Sparse Linear Systems to transpose a sparse matrix. The complete code can be found at my GitHub repository

The code might not be very fast compared to an efficient implemenation in languages like C but I think it is decently fast. It can probably be optimized further and conatct me if you can make it faster.

On my modest 2.3 GHz Intel Core i5 and Low Power 8 GB RAM machine, it takes 20 ish seconds to read the covtype (581,012 observations and 54 features) libsvm dataset. The URL (2,396,130 observations and 3,231,961 features) dataset, a 2.1 GB file takes 10 minutes to load. The libabray can be used by supplying the path to the libsvm file.

References

Tim Davis, Direct Methods for Sparse Linear Systems, 2006, SIAM