Type-Safe Blog: 2012

Setting up an search engine it's an extremely easy task. I've seen a lot's of production level implementation built on top of Lucene or Solr with very little effort. But when time passes and application is getting bigger and bigger search result is starting to be little bit messy. And then people came up with a lot's of ideas how to work your search result out. One of home-made most common (and very simple) solution I've seen so far is to gather keywords from search result and put them back along with original query to search engine. Actually is not that bad idea, one can find a lots of stuff around the subject in literature under the term: [query expansion].

I'm working recently in a group where [Rough Sets] theory is very popular. So let's try to put our simple idea for QE in this formalism. With every term from dataset there is a set of terms that collocate with it. Upper approximation of this set consists of terms that have first-order [collocation] higher than certain threshold. This approach is well described in [paper]. Authors derive Extended Weighting Scheme from standard TF-IDF weighting making use of this higher-order collocation feature, (U_R(d_i) is sum of sets of upper approx for all t in document d_i):

Method was orginally developed for document clustering but the weighting schemata obviously follows our simple intuition for QE and moves it to indexing time. I've implemented this formula in python using sparse matrices. Implementation is time/space efficient but not incremental (which means you need to "see" all the documents at indexing time)

import numpy as np
from math import log 
from scipy import sparse 
import timeit

def sparse_trsm(A, fi = 10):   
    A = sparse.csc_matrix(A, dtype=float)
    E, M = [], float(A.shape[0])
    Acoo = A.tocoo(copy = False)
    A.data = np.ones(len(A.data))
    C = A.transpose().dot(A) # 0.6s, lot's of allocation
    A = A.tocsr()
    np.putmask(C.data, C.data < fi, 0)
    C.eliminate_zeros()     

    for i in xrange(A.shape[0]):
        words = A.indices[A.indptr[i]: A.indptr[i + 1]]
        if len(words) > 0:
            #you can fix C[words,:] slicing here 
            U = np.setdiff1d( C[words,:].indices, words) #1.19s
        else : 
            U = []
        E.append(U)
    # 0.00802702903748s  2.67190599442s
    f_D = A.sum(axis = 0).tolist()[0] 
    idf = [log(M / ti) for ti in f_D ]
    idfs2 = np.vectorize(lambda i: log(M/i) / (1+log(M/i))) ( f_D )
    data = tuple( (1 + log(v)) * idf[t]  for (t, v) in zip(Acoo.col, Acoo.data))
    A = sparse.coo_matrix((data, (Acoo.row, Acoo.col))).tocsr() 
    A = A.tocsr(copy = False)
    cols = [i for s in E for i in s]
    rows = [j for s in  [[i] * l for (i, l) in enumerate(map(len, E))] for j in s]
    data = []
    append = data.append

    mintfidf = [min( A.data[ A.indptr[i] : A.indptr[i + 1] ])  
                   if A.indptr[i] != A.indptr[i + 1] else 0. for i in xrange(int(M))]
    for i,j in zip(rows, cols):
        wij = mintfidf[i] * idfs2[j]
        append(wij)        
    W = sparse.coo_matrix((data,(rows,cols)), shape = (A.shape) )
    return A + W, W

Well it seems that python isn't that slow - indexing of 10 000 of documents from routers data set took 160sec.

But here comes bummer: number of changes depends on the size of document collection:

For first 1000 docs number of nonzero elements in output doc-term matrix if 140 times larger. It's just that: thresholding of such feature as first-order collocation is too simple. Need to came up with something else. I'm thinking about something like a "second-order TF-IDF" for ngrams will be nice over here.

Recently I had the opportunity to co-ordinate an event in our dept. This was a part of [art&science festival] on our university. With bunch of students we put together some exhibitions loosely connected with human-computer-interaction. It was lot of fun! We used all of those things... you know: kinnect, wii, cameras, infrared leds, augumented reality, arduino etc. (see selected videos below). It isn't much but we were able to mount all this in just one week starting almost from scratch.

During the event I met some older guy (engineer I suppose) who was amazed how fast we arranged so many stuff. Well... Today we just have better tools. Open-Source and Open-hardware just kicks ass - It's a industrial revolution. Just watch this TED to learn more.

Type-Safe Blog

środa, 5 września 2012

Rough Set approch for document retrieval (part 1)

Art & Science Festiwal

niedziela, 26 lutego 2012

Interactive Table

O mnie

Etykiety

Archiwum bloga