Clusterer (jInfer API Specification)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

jInfer

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

cz.cuni.mff.ksi.jinfer.twostep.clustering
Interface Clusterer<T>

All Known Subinterfaces:: ClustererWithAttributes<T,S>

All Known Implementing Classes:: AttributeClusterer, Iname, SimpleDataClusterer

public interface Clusterer<T>

Interface for clustering algorithms implementations. Actual implementors will probably use AbstractNode as generic class for Clusterer and differentiate Nodes coming for clustering in runtime.

Purpose of clustering is to cluster elements based on some criterion into clusters - generally same name. Sometimes elements with same name appear in documents with different semantics, sometimes misspelled element names in documents causes semantically same elements to have different names.

Clusterer have to deal with these issues.

As method getRepresentantForItem is used for getting one representant of element/simpledata/attribute when adding steps into automaton (which have to be A.equals(B) when node A and B are in same cluster), clusterer have to parse elements right sides. Maybe by just doing:

 for (Node x : queue) {
   if (x.isElement) {
     this.addAll(((Element) x).getSubnodes().getTokens());
   }
 }

When automaton is created, getRepresentantForItem() is called for everything on elements right side of rule. So Clusterer have to deal with SimpleData (one cluster for all simpledata nodes), attributes are omitted in automaton creation, can be omitted in clusterer. For those who wish to write simplifier with attributes handling, take a look at ClustererWithAttributes interface.

Each item has to be in exactly one cluster (that's what clustering is all about).

Method Summary
`void`	`add(T item)` Add x to some clusterer, enqueue for processing.
`void`	`addAll(Collection<T> items)` Add the whole collection to queue for clustering
`void`	`cluster()` Do the main job, cluster enqueued items into clusters.
`List<Cluster<T>>`	`getClusters()` Without doing clustering again, return result of last cluster() call.
`T`	`getRepresentantForItem(T item)` Return representative of the item's cluster

Method Detail

add

void add(T item)

Add x to some clusterer, enqueue for processing. Don't implement clustering here, has to be in cluster() method to enable thread interruption.

Parameters:: item - to add

addAll

void addAll(Collection<T> items)

Add the whole collection to queue for clustering

Parameters:: items - elements to add

cluster

void cluster()
             throws InterruptedException

Do the main job, cluster enqueued items into clusters. But don't throw away old items. If there are already some items in clusters, user suppose they didn't disappeared.

Example: add(x), add(y), add(xx), add(yx), enqueued items: x, y, xx, yx. Calling cluster() creates clusters for example based on starting letter. Creates two clusters: (x, xx) | (y, yx)

Now let user use add(xd), add(zz). Calling cluster() again have to result in (x, xx, xd) | (y, yx) | (zz)

Of course, if cluster criterion is not so stable as first letter, items x, xx, y, yx can change their clusters and so. Point is, that they don't disappear. Once an item is added clusterer has to hold it for future cluster() calls. cluster method has to check for interruption of thread by using: if (Thread.interrupted()) { throw new InterruptedException(); } in some main loop.

Throws:: InterruptedException

getRepresentantForItem

T getRepresentantForItem(T item)

Return representative of the item's cluster

Parameters:: item -
Returns:

getClusters

List<Cluster<T>> getClusters()

Without doing clustering again, return result of last cluster() call.

Returns:: all clusters obtained by last call of cluster