Text Similarity Analysis Implementation and Testing

Project Information

Course Repository Link
Software Engineering Project Repo
Objective Build a basic engineering project

Personal Software Process (PSP) Summary

Stage Task Description Estimated (Min) Actual (Min)
Planning Estimation 30 30
Development Core Implementation 720 805
Analysis & Learning 120 150
Documentation 60 60
Review 45 45
Standards 45 50
Design 60 70
Coding 180 200
Code Review 30 45
Testing 180 185
Reporting Documentation & Retrospective 80 90
Test Report 60 70
Size Measurement 10 15
Process Improvement 10 5
Total 830 925

Interface Design and Implementation

The core logic relies on calculating Cosine Similarity between text vectors. The implementation involves tokenizing input strings, merging vocabulary sets, and computing dot products and magnitudes to determine the similarity score.

Performance Analysis

Profiling indicates that the most frequent operations involve byte[] array manipulations and String processing, particularly during the tokenization and file reading stages.

Unit Testing Suite

1. Tokenization Module

Expand Code
public class TokenizerTest {
    @Test
    public void testComplexSentence() {
        String input = "In the dark attic, on the moonbed, inside the turtle's dream.";
        List<String> predicted = Arrays.asList("dark", "attic", "moonbed", "inside", "turtle", "dream");
        List<String> result = TextTokenizer.tokenize(input);
        assertEquals(predicted, result);
    }

    @Test
    public void testPunctuationRemoval() {
        String input = "Ah! Ah! Ah!";
        List<String> predicted = Arrays.asList("Ah", "Ah", "Ah");
        List<String> result = TextTokenizer.tokenize(input);
        assertEquals(predicted, result);
    }

    @Test
    public void testVocabularyMerge() {
        List<String> first = Arrays.asList("apple", "banana");
        List<String> second = Arrays.asList("cherry", "date");
        List<String> predicted = Arrays.asList("apple", "banana", "cherry", "date");
        List<String> result = VocabularyUtil.merge(first, second);
        assertEquals(predicted, result);
    }
}

2. Frequency Calculation Module

Expand Code
public class FrequencyCalculatorTest {
    @Test
    public void testWeightedFrequency() {
        // Input where item count equals its value: 5 appears 5 times, 4 appears 4 times, etc.
        List<String> source = Arrays.asList(
            "5", "5", "5", "5", "5",
            "4", "4", "4", "4",
            "3", "3", "3",
            "2", "2",
            "1"
        );
        List<String> vocab = Arrays.asList("0", "1", "2", "3", "4", "5");
        int[] predicted = {0, 1, 2, 3, 4, 5};
        int[] result = FrequencyAnalyzer.compute(source, vocab);
        assertArrayEquals(predicted, result);
    }

    @Test
    public void testEmptySource() {
        List<String> source = Arrays.asList();
        List<String> vocab = Arrays.asList("0", "1", "2");
        int[] predicted = {0, 0, 0};
        int[] result = FrequencyAnalyzer.compute(source, vocab);
        assertArrayEquals(predicted, result);
    }

    @Test
    public void testEmptyVocabulary() {
        List<String> source = Arrays.asList("1", "2");
        List<String> vocab = Arrays.asList();
        int[] predicted = {};
        int[] result = FrequencyAnalyzer.compute(source, vocab);
        assertArrayEquals(predicted, result);
    }
}

3. Cosine Similarity Module

Expand Code
public class SimilarityMetricTest {
    @Test
    public void testIdenticalVectors() {
        int[] vA = {5, 10, 15};
        int[] vB = {5, 10, 15};
        double score = SimilarityMetric.calculate(vA, vB);
        assertEquals(1.0, score, 0.0001);
    }

    @Test
    public void testOrthogonalVectors() {
        int[] vA = {1, 0, 0};
        int[] vB = {0, 2, 0};
        double score = SimilarityMetric.calculate(vA, vB);
        assertEquals(0.0, score, 0.0001);
    }

    @Test
    public void testOppositeDirection() {
        int[] vA = {2, 4, 6};
        int[] vB = {-2, -4, -6};
        double score = SimilarityMetric.calculate(vA, vB);
        assertEquals(-1.0, score, 0.0001);
    }

    @Test
    public void testProportionalVectors() {
        int[] vA = {3, 3, 3};
        int[] vB = {9, 9, 9};
        double score = SimilarityMetric.calculate(vA, vB);
        assertEquals(1.0, score, 0.0001);
    }

    @Test
    public void testZeroVector() {
        int[] vA = {0, 0, 0};
        int[] vB = {1, 5, 9};
        double score = SimilarityMetric.calculate(vA, vB);
        assertEquals(0.0, score, 0.0001);
    }
}

4. Test Coverage

The unit test suite achieves a high percentage of code coverage, ensuring critical paths in tokenization, frequency analysis, and vector mathematics are verified.

Exception Handling Strategy

  • Argument Parsing: Handles cases where command-line arguments are missing or malformed.
  • File I/O: Manages exceptions related to missing input files or permissions issues.
  • Output Operations: Catches errors occurring during the writing of results to the file system.

Tags: java Software Engineering Unit Testing Cosine Similarity Natural Language Processing

Posted on Fri, 08 May 2026 10:30:38 +0000 by amarquis