Text Similarity Analysis Implementation and Testing

Project Information

Course	Repository Link
Software Engineering	Project Repo
Objective	Build a basic engineering project

Personal Software Process (PSP) Summary

Stage	Task Description	Estimated (Min)	Actual (Min)
Planning	Estimation	30	30
Development	Core Implementation	720	805
	Analysis & Learning	120	150
	Documentation	60	60
	Review	45	45
	Standards	45	50
	Design	60	70
	Coding	180	200
	Code Review	30	45
	Testing	180	185
Reporting	Documentation & Retrospective	80	90
	Test Report	60	70
	Size Measurement	10	15
	Process Improvement	10	5
Total		830	925

Interface Design and Implementation

The core logic relies on calculating Cosine Similarity between text vectors. The implementation involves tokenizing input strings, merging vocabulary sets, and computing dot products and magnitudes to determine the similarity score.

Performance Analysis

Profiling indicates that the most frequent operations involve byte[] array manipulations and String processing, particularly during the tokenization and file reading stages.

Unit Testing Suite

1. Tokenization Module

Expand Code

public class TokenizerTest {
    @Test
    public void testComplexSentence() {
        String input = "In the dark attic, on the moonbed, inside the turtle's dream.";
        List<String> predicted = Arrays.asList("dark", "attic", "moonbed", "inside", "turtle", "dream");
        List<String> result = TextTokenizer.tokenize(input);
        assertEquals(predicted, result);
    }

    @Test
    public void testPunctuationRemoval() {
        String input = "Ah! Ah! Ah!";
        List<String> predicted = Arrays.asList("Ah", "Ah", "Ah");
        List<String> result = TextTokenizer.tokenize(input);
        assertEquals(predicted, result);
    }

    @Test
    public void testVocabularyMerge() {
        List<String> first = Arrays.asList("apple", "banana");
        List<String> second = Arrays.asList("cherry", "date");
        List<String> predicted = Arrays.asList("apple", "banana", "cherry", "date");
        List<String> result = VocabularyUtil.merge(first, second);
        assertEquals(predicted, result);
    }
}

2. Frequency Calculation Module

Expand Code

public class FrequencyCalculatorTest {
    @Test
    public void testWeightedFrequency() {
        // Input where item count equals its value: 5 appears 5 times, 4 appears 4 times, etc.
        List<String> source = Arrays.asList(
            "5", "5", "5", "5", "5",
            "4", "4", "4", "4",
            "3", "3", "3",
            "2", "2",
            "1"
        );
        List<String> vocab = Arrays.asList("0", "1", "2", "3", "4", "5");
        int[] predicted = {0, 1, 2, 3, 4, 5};
        int[] result = FrequencyAnalyzer.compute(source, vocab);
        assertArrayEquals(predicted, result);
    }

    @Test
    public void testEmptySource() {
        List<String> source = Arrays.asList();
        List<String> vocab = Arrays.asList("0", "1", "2");
        int[] predicted = {0, 0, 0};
        int[] result = FrequencyAnalyzer.compute(source, vocab);
        assertArrayEquals(predicted, result);
    }

    @Test
    public void testEmptyVocabulary() {
        List<String> source = Arrays.asList("1", "2");
        List<String> vocab = Arrays.asList();
        int[] predicted = {};
        int[] result = FrequencyAnalyzer.compute(source, vocab);
        assertArrayEquals(predicted, result);
    }
}

3. Cosine Similarity Module

Expand Code

public class SimilarityMetricTest {
    @Test
    public void testIdenticalVectors() {
        int[] vA = {5, 10, 15};
        int[] vB = {5, 10, 15};
        double score = SimilarityMetric.calculate(vA, vB);
        assertEquals(1.0, score, 0.0001);
    }

    @Test
    public void testOrthogonalVectors() {
        int[] vA = {1, 0, 0};
        int[] vB = {0, 2, 0};
        double score = SimilarityMetric.calculate(vA, vB);
        assertEquals(0.0, score, 0.0001);
    }

    @Test
    public void testOppositeDirection() {
        int[] vA = {2, 4, 6};
        int[] vB = {-2, -4, -6};
        double score = SimilarityMetric.calculate(vA, vB);
        assertEquals(-1.0, score, 0.0001);
    }

    @Test
    public void testProportionalVectors() {
        int[] vA = {3, 3, 3};
        int[] vB = {9, 9, 9};
        double score = SimilarityMetric.calculate(vA, vB);
        assertEquals(1.0, score, 0.0001);
    }

    @Test
    public void testZeroVector() {
        int[] vA = {0, 0, 0};
        int[] vB = {1, 5, 9};
        double score = SimilarityMetric.calculate(vA, vB);
        assertEquals(0.0, score, 0.0001);
    }
}

4. Test Coverage

The unit test suite achieves a high percentage of code coverage, ensuring critical paths in tokenization, frequency analysis, and vector mathematics are verified.

Exception Handling Strategy

Argument Parsing: Handles cases where command-line arguments are missing or malformed.
File I/O: Manages exceptions related to missing input files or permissions issues.
Output Operations: Catches errors occurring during the writing of results to the file system.

Tags: java Software Engineering Unit Testing Cosine Similarity Natural Language Processing

Posted on Fri, 08 May 2026 10:30:38 +0000 by amarquis

Freaks City

Text Similarity Analysis Implementation and Testing

Project Information

Personal Software Process (PSP) Summary

Interface Design and Implementation

Performance Analysis

Unit Testing Suite

1. Tokenization Module

2. Frequency Calculation Module

3. Cosine Similarity Module

4. Test Coverage

Exception Handling Strategy

Hot Tags