Building a Simple Sentiment Classification Model in Keras
To demonstrate how text can be processed and classified using neural networks, we construct a basic binary sentiment model that distinguishes between positive and negative phrases. This implementation uses Keras for building the network architecture with an embedding layer, sequence padding, and dense classification.
Data Preparation
We begin by defining a small corpus of textual samples along with their corresponding labels—1 for positive sentiment and 0 for negative:
docs = [
'Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!',
'Weak',
'Poor effort!',
'not good',
'poor work',
'Could have done better.'
]
labels = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
Text Encoding with One-Hot Representation
Each document is converted into a numerical form using one-hot encoding, which maps each word to an integer index within a fixed vocabulary size. Here, we set the vocabulary size to 50 to minimize hash collisions:
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
Sequence Padding
Since input sequences vary in length, they must be standardized. We use pad_sequences to ensure all sequences have the same length (in this case, 4). Shorter sequences are padded with zeros at the end (padding='post'):
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
Understanding pad_sequences Parameters
The function signature is:
tf.keras.preprocessing.sequence.pad_sequences(
sequences, maxlen=None, dtype='int32',
padding='pre', truncating='pre', value=0.0
)
- padding: Where to add padding — 'pre' adds before, 'post' adds after.
- truncating: Which part to remove if the sequence exceeds maxlen — 'pre' removes from the start, 'post' from the end.
- maxlen: Maximum allowed sequence length.
Example usage:
x = [[3], [5,6], [7,8,9]]
print(pad_sequences(x)) # Default: pad at front, result shape (3,3)
print(pad_sequences(x, maxlen=2)) # Truncate to length 2
print(pad_sequences(x, padding='post')) # Pad at the end
print(pad_sequences(x, truncating='post')) # Truncate from the end when needed
Model Architecture
We define a model using the Functional API. It includes:
- An
Embeddinglayer that transforms integer-encoded words into dense 8-dimensional vectors. - A
Flattenlayer to convert the 2D output (4 time steps × 8 dimensions) into a 32-element vector. - A final
Denselayer with sigmoid activation for binary classification.
def build_model():
inputs = Input(shape=(max_length,))
embedded = Embedding(vocab_size, 8, input_length=max_length)(inputs)
flattened = Flatten()(embedded)
outputs = Dense(1, activation='sigmoid')(flattened)
model = Model(inputs, outputs)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
return model
Training and Evaluation
The model is trained on the entire dataset for 100 epochs:
model = build_model()
model.fit(padded_docs, labels, epochs=100, verbose=0)
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print(f'Accuracy: {accuracy * 100:.2f}%')
Prediction Example
We test the trained model on a new phrase:
test_word = one_hot('good', vocab_size)
padded_test = pad_sequences([test_word], maxlen=max_length, padding='post')
prediction = model.predict(padded_test)
print(prediction)
This returns a probability score indicating the model's confidence in classifying the input as positive.