A DL solution for Tab-Text Data

Can it Beat XGBoost?

Natan Katz
Towards Data Science

--

Image Unsplash

Motivation

DL models achieved great results in various domains such as vision and NLP. However, when it comes to tabular data it appeared to show a lower success and in most of the applications it performs worse than XGBoost.

There are are plenty of reasons that may explain this phenomenon such as the fact that there is no obvious orderable metric for tabular data (when we handle Boolean variables or categories, the only metric that we can a-priorically assume is the discrete). However, there are less works on a combined data: data that consists of both tabular columns with text. In this post I will present a DL solution for such data base. Our target variable will be a multi-categorical one.

For presenting the engine, I will create a mock data base using scikit-learn ‘s datasets services.

def get_mock_db(my_pickle_name):
covert=fetch_covtype(data_home=None, download_if_missing=True,
random_state=None, shuffle=False)
categories=['alt.atheism', 'soc.religion.christian',
'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',remove=
('headers', 'footers', 'quotes'), categories=categories,
shuffle=True, random_state=42)
clean_text=[(i.replace('\n','').replace('\t','').replace('\\',
''),j) for i,j in zip(twenty_train.data,twenty_train.target)]
clean_text =[ i for i in clean_text if len(i[0])>20]
len_0 =len([i for i in clean_text if i[1]==0])
len_1 =len([i for i in clean_text if i[1]==1])
len_2= len([i for i in clean_text if i[1]==2])
len_3 =len([i for i in clean_text if i[1]==3])
a, b =covert.data.shape
cov_0 = [covert.data[j] for j in range(a) if covert['target']
[j] ==4 ][:len_0]
cntr=0
for j in range(len(clean_text)):
if clean_text[j][1]==0:
a1,b1 =clean_text[j]
clean_text[j]= (a1, [cntr],b1)
cntr+=1
if cntr ==len_0:
break
cov_1 = [covert.data[j] for j in range(a) if covert['target']
[j]==1 ][:len_1]
cntr=0
for j in range(len(clean_text)):
if (len(clean_text[j])==2) and clean_text[j][1]==1:
a1,b1 =clean_text[j]
clean_text[j]= (a1, cov_1[cntr],b1)
cntr+=1
if cntr ==len_0:
break
cov_2=[covert.data[j] for j in range(a) if covert['target'][j]==3 ][:len_2]
cntr=0
for j in range(len(clean_text)):
if (len(clean_text[j])==2) and clean_text[j][1]==2:
a1,b1 =clean_text[j]
clean_text[j]= (a1, cov_2[cntr],b1)
cntr+=1
if cntr ==len_0:
break
cov_3=[covert.data[j] for j in range(a) if covert['target'][j]==3 ][:len_3]
cntr=0
for j in range(len(clean_text)):
if (len(clean_text[j])==2) and clean_text[j][1]==3:
a1,b1 =clean_text[j]
clean_text[j]= (a1, cov_3[cntr],b1)
cntr+=1
if cntr ==len_0:
break
with
open(my_pickle_name, 'wb') as f:
pickle.dump(clean_text, f, pickle.HIGHEST_PROTOCOL)
print ("files_genrated")
return

We can transform the tabular data ins many ways such as taking integer columns as they are, performing a deterministic aggregator, or converting Boolean variables to a one hot coding representation. However, in this post i will try a different approach: I will use an embedding procedure for the .tabular data. For this purpose we will use Tabnet. This package offers several embedding methods .

The following code performs such embedding

from pytorch_tabnet.pretraining import TabNetPretrainer
clf = TabNetPretrainer()
clf.fit(X_train[:n_steps])
#Here we embed the data
embed_tabular = clf.predict(X_test)[1]
raw_data =[(text[j], embed_tabular[j], label[j]i[0],j,i[2]) for i,j in zip(data, embed_tabular]

Now data is prepared and saved to a pickle file. We can move forward to pre process the text using Huggingsface’s tokenizers

Tokenization

In this part we take the text and convert it to a format that allows Huggingsface engine to perform its embedding. I used to types of tokenizers:

Bert and Fnet. The former is well known as a result of the raising of transformers , the latter is a quit new approach that suggests replacing the multi-head layers with FFT. One can learn about it here.

def get_tokenizer(data_yaml, bert_tokeinizer_name):
if data_yaml == enum_obj.huggins_embedding.fnet.value:
return FNetTokenizer.from_pretrained('google/fnet-base')
tokenizer = AutoTokenizer.from_pretrained(bert_tokeinizer_name,
use_fast=True)
return tokenizer
batch_encoding = tokenizer.batch_encode_plus([text for text in data], **tokenizer_params, return_tensors='pt')

Upon an exterior Yaml flag, we choose our tokenizer. The last line is written in a pseudo code and presents the tokenization step itself.

Creating Tensors Folder

In order to train and evaluate, we need to take the data (the combined one tabular and text) into a folder of tensor files that will be uploaded to the neural network using Pytorch’s DataLoader. We present several code pieces that generate this folder. We begin with the these two functions:


def
bert_gen_tensor(input_t, tab_tensor, lab_all_Place,
file_name, batch_enc, index_0):
input_m = torch.squeeze(torch.index_select(
batch_enc["attention_mask"], dim=0,
index=torch.tensor(index_0)))
torch.save([input_t, input_m, tab_tensor, lab_all_Place],
file_name)
return

def
fnet_gen_tensor(input_t, tab_tensor, lab_all_Place,
file_name, batch_enc=None, index_0=-1):
torch.save([input_t, tab_tensor, lab_all_Place], file_name)
return

One can see that they are nearly similar. They reflect the difference between the tensors that are required for Bert and Fnet, Bert requires a key: “attention_mask” that does not exist for Fnet. Thus on one hand we need to save different set of tensors for each embedding methods whereas on the other hand we wish to have a unique code. We solve this problem as follow:

def generate_data_folder_w_tensor(data_lab_and_docs, data_yaml):
.
.
if data_yaml['embed_val'] ==
enum_obj.huggins_embedding.fnet.value:
proc_func = fnet_gen_tensor
else:
proc_func = bert_gen_tensor

In addition to the key issue, the file contains the input data, the tabular data and the label . We handle the need for these two methods as follow:

We are now ready to iterate over the data items and create the tensors folder

for i, data in enumerate(data_lab_and_docs):
file_name = pref_val + "_" + str(i) + "_" + suff_val
tab_tensor = torch.tensor(data[1], dtype=torch.float32)
input_t =
torch.squeeze(torch.index_select(batch_encoding["input_ids"],dim=0, index=torch.tensor(i)))

proc_func(input_t, tab_tensor, data[2],file_name,
batch_enc= batch_encoding, 0index_0=i)
dic_for_pic[file_name]= data[2]

with open(data_yaml['labales_file'], 'wb') as f:
pickle.dump(dic_for_pic, f, pickle.HIGHEST_PROTOCOL)

This loop is pretty trivial; we set the file names and use the functions to save the information. We create a dictionary that maps the file names to their labels values for future needs.

As the process terminates we have a tensors folder to be used in the training and evaluation steps. We can nearly start training. Nearly? yes! Since we are working with Pytorch we need to create our data loader.

import torch
import enum_collection as enum_obj
import random

class dataset_tensor :
def __init__ (self, list_of_files, embed_val ):

self.list_of_files =list_of_files
random.shuffle(list_of_files)
if embed_val==enum_obj.huggins_embedding.fnet.value:
self.ref =[0,1,2]
else:
self.ref=[1,2,3]

def __getitem__(self, idx):
aa =torch.load(self.list_of_files[idx])
return aa[0], aa[self.ref[0]], a[self.ref[1]],aa[self.ref[2]]


def __len__(self) :
return len(self.list_of_files)

As one may notice, it is a standard structure for Pytorch dataloader. There is one thing which requires a clarification: self.ref array. Since Fnet and Bert use different tensors collection in a single file and we wish to use the same training procedure, we output for Fnet a void variable (the input term twice). self.ref determines which indices of the tensors in the file we output.

Model

This is probably the most interesting part in the code. The model performs two steps:

  • Embed embeds the tokenized data following a given recipe (Fnet or Bert )
  • Process the embedding tensor and maps it into a tensor ins the size of the amount of categories

The first step depends on the embedding type that we choose. Hence we require a special “forward” for each of them

class my_fnet(FNetForSequenceClassification):

def __init__(self, config, dim=768):
super(my_fnet, self).__init__(config )
self.dim= dim
self.num_labels = 4


self.distilbert = FNetModel(config)
self.init_weights()

self.pre_classifier = nn.Linear(self.dim, self.dim)
def forward(self, input_ids=None):
outputs = self.distilbert( input_ids=input_ids)
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
return pooled_output
class my_bert(BertForSequenceClassification):
def __init__(self, config, dim=768):
super(my_bert, self).__init__(config )
self.dim= dim
self.num_labels = 4
self.distilbert = BertModel(config)
self.init_weights()

self.pre_classifier = nn.Linear(self.dim, self.dim)
def forward(self,
input_ids=None,
attention_mask=None,
head_mask=None,
inputs_embeds=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,

):
return_dict = return_dict
if return_dict is not None else
self.config.use_return_dict

distilbert_output = self.distilbert(
input_ids=input_ids,
attention_mask=attention_mask,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_state = distilbert_output[0]
pooled_output = hidden_state[:, 0]
# pooled_output = self.pre_classifier(pooled_output)
# hidden_state = distilbert_output[0]
return pooled_output

The model itself is presented below:

class my_model(nn.Module):

def __init__(self, data_yaml, my_tab_form=1, dim=768):
super(my_model, self).__init__( )
self.forward =self.bert_forward
if data_yaml['embed_val'] ==
eum_obj.huggins_embedding.distil_bert.value:
self.dist_model = my_dist.from_pretrained('distilbert-
base-multilingual-cased'
,num_labels=4 )
elif data_yaml['embed_val'] ==
enum_obj.huggins_embedding.base_bert.value:
self.dist_model =
my_bert.from_pretrained('bert-base-multilingual-cased',
num_labels=4)
else:
self.dist_model = my_fnet.from_pretrained('google/fnet-
base'
, num_labels=4)
self.forward = self.fnet_forward
if my_tab_form>0 :
localtab= data_yaml['tab_format']
else :
localtab =my_tab_form
if localtab == enum_obj.tab_label.no_tab.value:
print ("no_tab")
self.embed_to_fc = self.cat_no_tab
self.tab_dim = 0
else :
self.embed_to_fc = self.cat_w_tab
self.tab_dim =data_yaml['tab_dim']

self.dim=dim
self.num_labels =4

self.pre_classifier = nn.Linear( self.dim, self.dim)
self.inter_m0= nn.Linear(self.dim +self.tab_dim,216)

self.inter_m1 = nn.Linear(216,64)
self.inter_m1a = nn.Linear(64, 32)


self.inter_m3 = nn.Linear(32, self.num_labels)
self.classifier = nn.Linear(self.dim, self.num_labels)
self.dropout = nn.Dropout(0.2)



def cat_no_tab (self, hidden, x):
return hidden
def cat_w_tab (self, hidden, x):
return torch.cat((hidden, x),dim=1)

def fnet_forward(self, x,
input_ids=None, attention_mask=None):

hidden_state = self.dist_model(input_ids)

pooled_output = torch.cat((hidden_state, x), dim=1)
pooled_output = self.inter_m0(pooled_output) # (bs, dim)
pooled_output = nn.ReLU()(pooled_output) # (bs, dim)
pooled_output = self.dropout(pooled_output) # (bs, dim)

pooled_output = self.inter_m1(pooled_output) # (bs, dim)
pooled_output = nn.ReLU()(pooled_output) # (bs, dim)
pooled_output = self.dropout(pooled_output) # (bs, dim)

pooled_output = self.inter_m1a(pooled_output)
pooled_output = nn.ReLU()(pooled_output)
pooled_output = self.dropout(pooled_output)
logits = self.inter_m3(pooled_output)
return logits


def bert_forward(self, x, input_ids=None,
attention_mask=None) :
hidden_state =self.dist_model(input_ids, attention_mask)
pooled_output = self.embed_to_fc(hidden_state, x)

pooled_output = self.inter_m0(pooled_output)
pooled_output = nn.ReLU()(pooled_output) # (bs, dim)
pooled_output = self.dropout(pooled_output) # (bs, dim)
pooled_output = self.inter_m1(pooled_output) # (bs, dim)
pooled_output = nn.ReLU()(pooled_output) # (bs, dim)
pooled_output = self.dropout(pooled_output) # (bs, dim)
pooled_output = self.inter_m1a (pooled_output) # (bs, dim)
pooled_output = nn.ReLU()(pooled_output) # (bs, dim)
pooled_output = self.dropout(pooled_output) # (bs, dim)
logits = self.inter_m3(pooled_output) # (bs, num_labels)
return logits

In addition to the the different forward functions that we already discussed. We have the my_tab_form flag. This flag indicates whether we use solely the text or the combine data (we commonly use the latter). Code-wise, we replace torch.cat by a void function (it outputs the input tensor)

if my_tab_form>0 :
localtab= data_yaml['tab_format']
else :
localtab =my_tab_form

Loss Function

Our loss is often a standard cross entropy. Nevertheless, since for some applications we need to reduce specific failures which are associated with one of the categories, I added several specific functions that “uniquely punish” these errors. One can read more here

The Main Loop

Finally, we can use the entire blocks from beyond in order to perform the training step. We begin with bringing the data:

with open("project_yaml.yaml", 'r') as stream:
data_yaml = yaml.safe_load(stream)
list_of_files =os.listdir(data_yaml['tensors_folder'])
list_of_files =[data_yaml['tensors_folder']+i for i in list_of_files]
X_train,X_test =train_test_split(list_of_files, test_size=0.2)# Creating dataloader!!
train_t =dataset_tensor(X_train, data_yaml['embed_val'])
train_loader = DataLoader(train_t, batch_size=data_yaml['batch_size'], shuffle=True)

test_t = dataset_tensor(X_test, data_yaml['embed_val'])
test_loader = DataLoader(test_t, batch_size=data_yaml['batch_size'], shuffle=True)

We upload the Yaml file and splitting the data to train and test (0.2 for test size is for example needs no deep theory behind) . For the readers that are less familiar with torch, the last rows create the Dataloaders for both the train and the test files.

device = ""
if
torch.cuda.is_available():
device = torch.device("cuda:0")

model = my_model(data_yaml)

#pre -training usage
if
data_yaml['improv_model']:
print("Loading mode")
model_place = data_yaml['pre_trained_folder']
print (model_place)
model.load_state_dict(torch.load(model_place, map_location='cpu'))

if device:
model =model.to(device)

We set the device for the lucky readers that have a GPU. Afterwards we define the model structure and upload its weights in case we wish to use a pre-trained model.

optimizer = torch.optim.AdamW(model.parameters(),
lr=1e-5, eps=1e-8)

loss_man = loss_manager(data_yaml['batch_size'],
data_yaml['target_val'],data_yaml['reg_term'])
model.train()

We define the optimizer ( it is ought to decrease the learning rate for most of the transformers’ tasks) and the loss function. Now we can preform the training iterations

for i_epoc in range(data_yaml['n_epochs']):
running_loss = 0.0
counter_a=0
for batch_idx, data in enumerate(train_loader):

a, b, d, c= data
if device :
a=a.to(device)
b=b.to(device)
c=c.to(device)
d=d.to(device)

ss=model(x=d, input_ids=a, attention_mask=b)


loss =loss_man.crit(ss, c)
running_loss += loss.item()

loss.backward()
print(loss, batch_idx)
optimizer.step()
optimizer.zero_grad()


print ("Epoch loss= ",running_loss/(counter_a+0.))
print ("End of epoc")
torch.save(model.state_dict(), data_yaml['models_folder'] + "model_epoch_" + str(i_epoc) + ".bin")

We can add an eval loop at the end of the training using tqdm:

from tqdm import  tqdm
.
.
.
with
torch.no_grad():
with tqdm(total=len(test_loader), ncols=70) as pbar:
labels = []
predic_y = []
for batch_idx, data in enumerate(test_loader):

a, b, d, c = data
if device:
a = a.to(device)
b = b.to(device)
c = c.to(device)
d = d.to(device)
labels.append(c)
outp = model(x=d, input_ids=a, attention_mask=b)
probs = nn.functional.softmax(outp, dim=1)
predic_y.append(probs)
pbar.update(1)

y_true, y_pred = convert_eval_score_and_label_to_np(labels, predic_y)

The results and their analysis are left for the readers upon their own interest.

Summary

We presented a combined data model and a DL engine for training it. Architecturally speaking, we presented a trivial neural network. However, for some tasks that it was trained and tested, it has shown better results than similar XGBoost models. This gives the hope that DL engines can well handle tabular data. In addition to the merely DL aspect, I believe that combined data tasks, are endowed with nice mathematical riddles, since it studies not only different data types, the metrics and topologies that they induce. the interface between orderable and non orderable variables as well as complete and non-complete topologies, may provide a wide domain of research. Another question that one may ask while reading this post is whether we can improve the FFT that is used for Fnet (e.g. using Wavelets).

Acknowledgment

I wish to acknowledge Uri Itai for providing fruitful ideas during writing this post.

Code is located here.

--

--

Interested in theory behind the ML non-linearity stochasticity sampling Bayesian inference & generative models “Tiefe Gedanken sind ewig, daher der größte Spaß”