Duplicate Question Detector

A Machine Learning Model to Detect Duplicate Questions

When: December 2022 Where: CMPT 732 (course-project)

Hands down the most exciting part of my first term at SFU. This project was my first foray into applied machine learning and it was equal parts challenging and rewarding.

We developed two machine learning models (based off StackOverflow data) to address the issue of duplicate questions on public Q/A forums:

Using feature engineering, and sentence encoders we developed a model that can identify if ANY two questions are similar.

We developed a querying interface* which, given a user question, could fetch similar questions from the StackOverflow dataset.

Some of the technologies used throughout the project

Flowchart of design process: data extraction and visualization to model training

What was my role?

Data visualization and analysis: performing analysis on preprocessed and processed StackOverflow dataset to understand the distribution of questions and any inherent biases in the data.

Feature engineering: testing multiples features (word count, shared words, fuzzy ratio, etc.) to evaluate their efficacy in distinguishing similar and dissimilar questions.

Is difference in question length a good indicator of how similar (True) or dissimilar (False) two questions are? Short answer: No.

Analyzing the distribution of questions in the dataset. We have primarily unique (# occurences = 1) questions.