Duplicate Question Detector
A Machine Learning Model to Detect Duplicate Questions
When: December 2022 Where: CMPT 732 (course-project)
"
Hands down the most exciting part of my first term at SFU. This project was my first foray into applied machine learning and it was equal parts challenging and rewarding.
"
We developed two machine learning models (based off StackOverflow data) to address the issue of duplicate questions on public Q/A forums:
Using feature engineering, and sentence encoders we developed a model that can identify if ANY two questions are similar.
We developed a querying interface* which, given a user question, could fetch similar questions from the StackOverflow dataset.
Some of the technologies used throughout the project
Flowchart of design process: data extraction and visualization to model training
What was my role?
Data visualization and analysis: performing analysis on preprocessed and processed StackOverflow dataset to understand the distribution of questions and any inherent biases in the data.
Feature engineering: testing multiples features (word count, shared words, fuzzy ratio, etc.) to evaluate their efficacy in distinguishing similar and dissimilar questions.
Is difference in question length a good indicator of how similar (True) or dissimilar (False) two questions are? Short answer: No.
Analyzing the distribution of questions in the dataset. We have primarily unique (# occurences = 1) questions.