DWDM \ Text Clustering
Clustering is process of grouping of similar kind of things to one group and remaining to other group.
Text Clustering is process of set of unlabelled texts to one group and remaining to other group.
Text clustering steps
step Name |
Details |
Text pre-processing |
It involves Tokenization, Transformation, Normalization and Filtering. |
Feature Extraction |
It is used to extract the features (words/tokens/ document/corpus) from textual data Clustering and those features are used to cluster different text documents.. |
KeyWord |
Definition |
Tokenization |
It parses text data into smaller parts (tokens= words / phrases). |
Transformation |
It converts the text to lowercase |
Normalization |
It transforms a text into a canonical (root) form. root word deriving techniques are Stemming and lemmatization. |
Filtering |
Words which are not having any meaning are removed from the texts for clustering. |
3 Levels of text clustering
Word level clustering |
It is a process used to group words by collecting synonyms for a particular word. |
Sentence level clustering |
It is a process used to group sentences from different documents. Example Twitter analysis. |
Document level clustering |
It is a process used to group documents based on a topic. Example emails, search engines, etc. |
Text clustering similarity measures
Words can be checked for 2 types of similarities. They were lexically similarity or semantically similarity.
Lexical similarity |
Words are said to be lexically similar if they have a similar character sequence and measured using string-based algorithms. |
Semantic similarity |
Words are said to be semantically similar if they have the equal meaning, are opposite of each other and measured using knowledge-based algorithms. |
Text Clustering Algorithms
1. Hierarchical Clustering Algorithm
It is of 2 types. They were.
a. Divisive approach.
b. Agglomerative approach.
Divisive approach
It start with one cluster and split that into sub-clusters.
Example algorithms: DIANA and MONA.
Agglomerative approach
It start merging small clusters to form big cluster.
Examples algorithms: BIRCH and CURE.
Partitioning
Examples algorithms: k-means, ISODATA and PAM.
Density
clusters are formed based on how many data points fall within a given radius.
Examples algorithms: DBSCAN
Graph
It addresses the document similarity.
Probabilistic
Here words belong to topics are assigned probabilities to cluster.
|