This thesis intends to investigate the usefulness of various aspects of product data for user behavior prediction in the online shopping market. Specifically, a data set from BestBuy was used, containing information regarding what product a user clicked on given their search query.
Decision trees are machine learning algorithms used for making predictions. The decision tree algorithm ID3 was used because of its simplicity and interpretability. It uses information gain to measure how different attributes help the tree split the set into smaller subsets. The approach was to use one decision tree for each product in the data set, and analyze the distribution of the attributes' maximum information gains in the root splits across the various trees. For each of these splits, all possible pivot values (a pivot value being the value split on) were attempted, and the pivot values were also recorded to analyze which pivot values that resulted in the most gain.
The results show that how well the query string matches the product title and description are the two most important aspects, followed by the product's novelty. The number of days since the last two reviews were written before the query proved a decent way to identify trends.
The paper also presents how the attributes were used by analyzing the pivot value distributions, with the conclusion that many attributes were used in similar ways for most products, suggesting it might be possible to create a universal tree applicable for all products.
Regarding the usefulness of decision trees, it was found that they are not very efficient for highly volatile databases, such as those found in the online shopping market. The notion of a universal tree, however, suggests that future work might investigate whether their efficiency could be improved using this, more flexible, approach.