The very first version of the Lingo clustering engine we created is
available free of charge as part of the Open Source Carrot2 Framework.
While the commercial edition of Lingo builds on the experience we
gained from on the Open Source engine, we decided to completely
rewrite the code in order to achieve superior clustering quality
and performance.
| Feature |
Open Source edition |
Commercial edition |
| Time of clustering [s]* |
100 results |
0.48 s |
0.01 s |
| 200 results |
0.34 s |
0.03 s |
| 400 results |
0.74 s |
0.05 s |
| 10,000 results |
---** |
0.53 s |
| Hierarchical clustering |
no |
yes |
| Customizable stop word list |
yes |
yes |
| Label filtering (suppressing specific words or phrases in the output cluster labels) |
yes |
yes |
| Label boosting (promoting specific words or phrases in the output cluster labels) |
no |
yes |
| Synonyms (defining groups of words or phrases to be treated as synonymous) |
no |
yes |
| Document-to-cluster misassignment (ratio of documents in a cluster that are irrelevant to the cluster label) |
medium |
low |
| Results tuning |
basic |
advanced *** |
| Further development |
Only critical bugfixes |
New features planned |
*) Clustering speed measurements were done for 100, 200, 400
snippets downloaded from Yahoo! for query 'lucene', using the
Lingo3G Document Clustering Workbench. Benchmark environment: Intel
Core2 Duo E8400 3GHz, 3GB MB RAM, Windows XP. Java Virtual Machine:
Sun JDK 1.6.0, JVM switches: -server -Xmx512m. Time presented in the table
is an average of 100 runs, for each algorithm time measurement was
preceded by 100 untimed warm-up runs
**) Open Source edition is not scalable enough to reliably cluster
very large numbers of documents.
***) The following aspects of clustering be tuned:
preferred number of clusters and depth hierarchy, preferred length
of cluster labels, desired number of unclustered documents,
document-to-cluster assignment precision, maximum cluster size and
many more.
Learn more: Applications | Features | Integration