The very first version of the Lingo clustering engine we created is
available free of charge as part of the Open Source Carrot2 Framework.
While the commercial edition of Lingo builds on the experience we
gained from on the Open Source engine, we decided to completely
rewrite the code in order to achieve superior clustering quality
and performance.
| Feature |
Open Source edition |
Commercial edition |
| Time of clustering [s]* |
100 results |
0.34 s |
0.06 s |
| 200 results |
0.52 s |
0.10 s |
| 400 results |
0.84 s |
0.17 s |
| 5000 results |
---** |
1.70 s |
| Hierarchical clustering |
no |
yes |
| Customizable stop word list |
yes |
yes |
| Label filtering (suppressing specific words or phrases in the output cluster labels) |
no |
yes |
| Label boosting (promoting specific words or phrases in the output cluster labels) |
no |
yes |
| Synonyms (defining groups of words or phrases to be treated as synonymous) |
no |
yes |
| Document-to-cluster misassignment (ratio of documents in a cluster that are irrelevant to the cluster label) |
medium |
low |
| Number of tunable parameters |
2 |
55*** |
| Further development |
Only critical bugfixes |
New features planned |
*) Clustering speed measurements were done for 100, 200, 400
snippets downloaded from Yahoo! for query 'london', using the
Lingo3G Tuning Browser application. Benchmark environment: Pentium
M 1.3 GHz, 768 MB RAM, Windows XP. Java Virtual Machine: Sun JDK
1.4.2, JVM switches: -Xmx512m -Xms128m -XX:NewRatio=1 -server. Time
presented in the table is an average of 75 runs, for each algorithm
time measurement was followed by 25 untimed warm-up runs
**) Open Source edition is not scalable enough to reliably cluster
very large numbers of documents.
***) Using parameters the following aspects can be tuned:
preferred number of clusters and depth hierarchy, preferred length
of cluster labels, desired number of unclustered documents,
document-to-cluster assignment precision, maximum cluster size and
many more.
Learn more: Applications | Features | Integration