Text Clustering

DWDM \ Text Clustering

Clustering is process of grouping of similar kind of things to one group and remaining to other group.
Text Clustering is process of set of unlabelled texts to one group and remaining to other group.

Text clustering steps

step Name	Details
Text pre-processing	It involves Tokenization, Transformation, Normalization and Filtering.
Feature Extraction	It is used to extract the features (words/tokens/ document/corpus) from textual data Clustering and those features are used to cluster different text documents..

KeyWord	Definition
Tokenization	It parses text data into smaller parts (tokens= words / phrases).
Transformation	It converts the text to lowercase
Normalization	It transforms a text into a canonical (root) form. root word deriving techniques are Stemming and lemmatization.
Filtering	Words which are not having any meaning are removed from the texts for clustering.

3 Levels of text clustering

Word level clustering	It is a process used to group words by collecting synonyms for a particular word.
Sentence level clustering	It is a process used to group sentences from different documents. Example Twitter analysis.
Document level clustering	It is a process used to group documents based on a topic. Example emails, search engines, etc.

Text clustering similarity measures
Words can be checked for 2 types of similarities. They were lexically similarity or semantically similarity.

Lexical similarity	Words are said to be lexically similar if they have a similar character sequence and measured using string-based algorithms.
Semantic similarity	Words are said to be semantically similar if they have the equal meaning, are opposite of each other and measured using knowledge-based algorithms.

Text Clustering Algorithms
1. Hierarchical Clustering Algorithm It is of 2 types. They were.
a. Divisive approach.
b. Agglomerative approach.

Divisive approach
It start with one cluster and split that into sub-clusters.
Example algorithms: DIANA and MONA.

Agglomerative approach
It start merging small clusters to form big cluster.
Examples algorithms: BIRCH and CURE.

Partitioning
Examples algorithms: k-means, ISODATA and PAM.

Density
clusters are formed based on how many data points fall within a given radius. Examples algorithms: DBSCAN

Graph
It addresses the document similarity.

Probabilistic
Here words belong to topics are assigned probabilities to cluster.

Home Back