Duplicate Question Detector

A Machine Learning Model to Detect Duplicate Questions

When: December 2022 Where: CMPT 732 (course-project)

"

Hands down the most exciting part of my first term at SFU. This project was my first foray into applied machine learning and it was equal parts challenging and rewarding.

"

We developed two machine learning models (based off StackOverflow data) to address the issue of duplicate questions on public Q/A forums:


  1. Using feature engineering, and sentence encoders we developed a model that can identify if ANY two questions are similar.


  1. We developed a querying interface* which, given a user question, could fetch similar questions from the StackOverflow dataset.

Some of the technologies used throughout the project

Flowchart of design process: data extraction and visualization to model training

What was my role?


  1. Data visualization and analysis: performing analysis on preprocessed and processed StackOverflow dataset to understand the distribution of questions and any inherent biases in the data.


  1. Feature engineering: testing multiples features (word count, shared words, fuzzy ratio, etc.) to evaluate their efficacy in distinguishing similar and dissimilar questions.


Is difference in question length a good indicator of how similar (True) or dissimilar (False) two questions are? Short answer: No.

Analyzing the distribution of questions in the dataset. We have primarily unique (# occurences = 1) questions.