Open in new window / Try shogun cloud
--- Log opened Sat Dec 17 00:00:19 2011
-!- puneetgoyal [~puneetgoy@] has joined #shogun05:56
-!- Ram108 [~amma@] has joined #shogun06:38
-!- puneetgoyal [~puneetgoy@] has quit [Quit: Leaving]10:30
-!- puneetgoyal [~puneetgoy@] has joined #shogun10:30
-!- blackburn [~blackburn@] has joined #shogun11:07
@sonney2kblackburn, I think we should try to get this fast k-means code into shogun11:36
@sonney2kshouldn't be too difficult so if someone asks for what to do...11:36
@sonney2kalso, porting birch (java) to c++ would make sense11:36
blackburnsonney2k: hmm sure11:37
@sonney2kand I would propose some new branch of algorithms for sampling from data, i.e. getting a uniform subset11:37
@sonney2kor some other more complicated ways11:37
blackburnyou are nipsted11:38
@sonney2kthere is more11:38
@sonney2ksomeone here had a fast cross - validation scheme11:38
@sonney2kthat would be very cool stuff for heiko11:38
blackburnsorry but I'm really out of time till new year11:38
blackburnfast CV scheme? how can it be fast?11:39
@sonney2kbtw, we have now shogun in debian unstable again...11:39
blackburndid you packed it?11:39
@sonney2kit does CV on small subsets of the data first11:39
@sonney2kand this way can throw away lots of combinations to be tested. then it increases these subsets slowly - speedups of 2 orders of magnitude without loss are possible11:40
blackburnsonney2k: I now need some binary tree classifier making possible to use say LDA for multiclass11:42
blackburnhow do you think it is useful?11:42
@sonney2kI am currently messing with python -builtin11:42
blackburnwhat is builtin?11:42
@sonney2kno more .py files but direct py objects11:43
@sonney2kan option for swig11:43
@sonney2kblackburn, you are talking about a certain error correcting codes scheme right?11:43
blackburnsonney2k: yes, it is a variant of binary tree classifier11:43
@sonney2kto convert any binary classifier into a multiclass classifier?11:44
@sonney2kthere was another very nice algorithm at nips for that11:44
@sonney2ksome boosting with minimal set of features - massively many classes - but very fast and accurate11:44
@sonney2kahh and a binary tree thingy too - that sounded very reasonable work - let me check if I find it11:45
blackburnI need something like that for my road signs work11:47
@sonney2khow many classes do you have?11:47
blackburnnow 43, but will have much more11:47
blackburnsome of them should be grouped for sure11:48
blackburnlike red signs, blue signs, etc11:48
@sonney2kone is ShareBoost: Efficient multiclass learning with feature sharing11:48
@sonney2kS. Shalev-Shwartz, Y. Wexler, A. Shashua11:48
blackburngot it thanks11:49
@sonney2kthe other is
@sonney2kthis is probably more what you could directly use11:50
@sonney2kit infers the tree and learns the SVms / LDA whatever11:50
@sonney2kseemed very reasonable11:50
blackburnyes, I had similar idea before I get to know it is already done :D11:51
@sonney2kblackburn, ohh yes please add pointers to all of that in the bts :-)11:51
@sonney2kthese are all very worthwhile ideas for gsoc11:51
@sonney2kbug tracking system11:51
@sonney2kaka github issues11:51
blackburnah yes I'll add it11:51
@sonney2kblackburn, that is the thing with the sampling
@sonney2kthey use it to do fast GMM estimation11:52
@sonney2ki.e. they sample from the data and then on that sample estimate the GMM11:52
@sonney2kbut the sampling is very clever...11:52
blackburnpretty simple idea heh11:53
blackburnwell and my point is that all the things should be as simple as it could be11:54
@sonney2kyeah - and easy to implement... we just need some new class of 'sampler' algorithms11:58
@sonney2kthe get as input some data and then return an index with a subset :)11:58
blackburnsonney2k: TreeMulticlassMachine?12:00
@sonney2kno I think this should be under classifier12:01
@sonney2kMulticlassTreeClassifier ?12:01
blackburnwhy classifier?12:05
blackburnah yes12:05
15SAAI18Mshogun: Sergey Lisitsyn master * rce5c547 / src/shogun/converter/DiffusionMaps.h : Mentioned paper on DiffusionMasp -
blackburnsonney2k: is your TU mail available for you still?12:12
@sonney2kbetter use mey address12:12
@sonney2kit is but not for long I think12:12
blackburnsonney2k: I'm asking cause Ori Cohen wrote to both of us, have you seen?12:13
blackburnit is a guy who was working on C# examples too12:13
@sonney2kyes I have seen - I don't know what to do about it12:14
@sonney2kI thought you added him to NEWS?12:14
blackburnsonney2k: yes, he said it is ok for him12:15
@sonney2kbtw we have to update the website right to point to github issues12:15
@sonney2kwhat is the url btw?12:15
@sonney2kyes that one, let me do the update12:16
@sonney2kthen I can also include Ori in the NEWS on the site12:16
blackburnyes please do then12:17
@sonney2kbtw, we got our windows7 buildbot12:19
@sonney2kI just didn't have time to administer it12:19
blackburnheh nice12:20
-!- puneetgoyal [~puneetgoy@] has quit [Ping timeout: 240 seconds]12:36
-!- puneetgoyal [~puneetgoy@] has joined #shogun12:48
@sonney2kblackburn, just one thought - should we add some print / string function to show a compact output of shogun objects?12:49
@sonney2ke.g. they could show their name and list parameters this way?12:49
blackburnsonney2k: well why not12:50
blackburnsonney2k: not the crucial thing though, no idea how to use it12:50
@sonney2kwell you could just do12:51
@sonney2kprint x12:51
@sonney2kand then it will not say12:51
@sonney2k<Swig Object of type 'shogun::CGaussianKernel *' at 0x7fe290d0db90>12:51
@sonney2kGaussianKernel - Parameters width=112:52
blackburnsure, I understand12:52
@sonney2kfor features it could show the same kind of summary we have for numpy arrays12:52
@sonney2krather useful I would say12:53
puneetgoyalhey :), why do we generally use this gaussian kernel ?13:04
puneetgoyalI most examples, I had seen this kernel only13:05
blackburnpuneetgoyal: it has some nice features13:07
blackburnlike 'virtual' infinite-dimension gilbert space mapping hah :)13:07
-!- naywhayare [~ryan@] has joined #shogun13:36
-!- Ram108 [~amma@] has quit [Remote host closed the connection]17:29
puneetgoyalhello, I was trying to tokenize emails and now a bit close to it...I wanted to know if there is a way to parse all the files containing the email data to store them in a matrix ?18:27
blackburnhmm how?18:28
puneetgoyalhow I made tokens from an email..or How I wanna parse all files?18:30
blackburnhow can you store email data in matrix?..18:31
puneetgoyalto tokenize emails...I used the email package and its various modules18:31
puneetgoyalI can extract various information from an email that will be used to calculate the probability of an email being a spam or a ham using that email package of python18:32
puneetgoyaland store them in a matrix18:32
blackburnis it a matrix of probabilities?18:34
puneetgoyalno, I guess probabilities will be calculated after I train my system using some emails ?18:35
blackburnso it is a token matrix?18:36
blackburnhow do you plan to use it?18:37
puneetgoyalshould I use some other method for training?18:37
blackburnyou may feel free to use any but I'm in doubts18:38
blackburncause stringfeatures in shogun supports just a list of strings18:38
blackburnbut not a list of list of strings18:38
puneetgoyalok, so from where should I procede?18:41
blackburnpuneetgoyal: I would suggest you to compute similarity measure with some written-by-you-technique18:42
blackburnand then form similarity matrix to train SVM or so18:42
blackburnpuneetgoyal: for example you can count identical tokens18:53
puneetgoyalblackburn: sry, forgot to reply...I was reading about similarty measures18:54
puneetgoyalidentical tokens?18:54
blackburnpuneetgoyal: ['this','is','spam'] is 1.0 to ['this','is','spam'], but 0.6667 to ['this','is','sparta']18:55
puneetgoyalblackburn: yes, but while testing right? I mean the list I will compare the mail with...would be made after training18:57
blackburnnot sure I understood you18:57
puneetgoyalI mean suppose the first list you gave is a mail you want to check, and second is the list you already have...that you know is a spam or a ham18:58
puneetgoyalbut to get the second list, you will first have to get some training data18:58
blackburnwell just get some training mails, determine its status18:59
blackburnand form matrix containing similarity between i-th and j-th mails18:59
puneetgoyalok, and would have to write the respective weights against each of the keywords19:00
puneetgoyalok, I will make a module to construct this matrix asap19:02
blackburnpuneetgoyal: hey why do you hurry?19:09
puneetgoyalblackburn: I dont have anything else to do can spend the whole time over this :D19:10
-!- ishaanmlhtr [~ishaan@] has joined #shogun19:14
-!- ishaanmlhtr [~ishaan@] has quit [Ping timeout: 240 seconds]19:31
-!- ishaanmlhtr [~ishaan@] has joined #shogun20:06
-!- ishaanmlhtr [~ishaan@] has quit [Ping timeout: 240 seconds]22:29
-!- ishaanmlhtr [~ishaan@] has joined #shogun22:31
-!- ishaanmlhtr [~ishaan@] has quit [Ping timeout: 244 seconds]22:42
-!- ishaanmlhtr [~ishaan@] has joined #shogun22:44
-!- puneetgoyal [~puneetgoy@] has quit [Quit: Leaving]23:16
--- Log closed Sun Dec 18 00:00:19 2011