top of page
to task

    Task    

Our task is to predict the similarity between questions posted on Quora to find duplicate questions. The similarity is found for pairs of questions based on the features extracted from the text of the questions such as the number of common words, the ratio of common words and so on.


We feel this task is important because lumping duplicate questions together can save users a lot of time and even assist them in finding answers they might not have gotten to in case they didn’t find and go through all the duplicate questions. We feel this might really benefit Piazza as well since duplicate questions are posted there often.

    Data    

to Data

The dataset which we have utilized is provided by Quora. The attributes are

  • id: Example row id

  • qid1: The unique ID of question1 in the pair

  • qid2: The unique ID of question2 in the pair

  • question1: Text of question 1 in the pair.

  • question2: Text of question 2 in the pair.

  • is_duplicate: Denotes whether the questions are duplicate. (Trying to predict this)

 

There are 404301 pairs of questions in the data set. That means there are essentially 404301 examples.

For viewing complete dataset, click here.

​

During analyzing the data, we plotted the following graph to check how many times a particular question is repeated and found that apart from a couple of questions, most of them are unique and thus the data is pretty solid.

Graph.PNG

Histogram of question appearance counts

to Approach

   Approach   

We started off by extracting a couple of simple features without altering the data. The two extracted features were the number of common words in the question pair and the ratio of the common words to the total number of words.

 

Only two features were not enough to make a good classifier, so we started processing the data and focused on getting more features. We ended up using the fuzzywuzzy module to extract the following features:

 

  • fuzz_ratio: Simple measure calculated using edit distance

  • fuzz_partial_ratio: Considers only shorter string for matching to get around inconsistent substrings.

  • token_sort_ratio: Sorts tokens alphabetically and then compares the strings.

  • token_set_ratio: Tokens are split into two groups- intersection and remainder and these are used to build up a comparison string.

  • longest_substr_ratio: Ratio of length longest common substring to min length of the token count of Q1 and Q2

 

We ran three classifiers on these features and achieved the following results:

Capture.PNG

Results with 7 features before adding tf-idf weighted word vectors.

Trying to improve our results, we moved on to converting the words into vectors. We used tf-idf weighted word vectors to represent the words in the questions. This gave us 384 features per question in the pair, resulting in 768 new features. We ran the above classifiers again on the newly acquired features along with XGBoost, achieving the following results:

CaptureAfter.PNG

Results after adding tf-idf weighted word vectors.

to result

    Result    

We found that using XGBoost on all of our features (tf-idf weighted word vectors, fuzzywuzzy features and two simple features giving a total of 775 features) gave the best results. The confusion, recall and precision matrices of the results are displayed below: 

xgboost777.png

Confusion, Precision and Recall Matrices for XGBoost (best classifier) on the data with tf-idf weighted word vectors added.

XGBoost outperformed the other classifiers in terms of F1 score and log-loss as well as shown in the tables above. The one surprising finding we came across was that Random Forest outperformed XGBoost when we were not using tf-idf weighted word vectors. We go into detail about why this might be in our report. Overall though, using tf-idf word vectors along with XGBoost proved to be the best solution so far.

    Future    

to future

We used spaCy models in our project which are trained on the language used on Wikipedia, which we feel does not cover the colloquial language that is used on sites like Quora, so in order to deal with this, we theorize that a character-level model might work better and thus should be our next step. We also think looking into better data pre-processing techniques like lumping synonyms together should help us get better results. Finally, the lack of powerful hardware kept us from using the entire set of data which hindered our results. So, trying to use the entire data set on more powerful machines will be at the top of our priority list. 

© 2023 by Name of Site. Proudly created with Wix.com

  • Facebook Social Icon
  • Twitter Social Icon
  • Google+ Social Icon
bottom of page