Input data should be formatted as a data frame, with one row for each observation, and one column for each feature. The first column should contain the true class labels. Factor features (ex: gender, site, etc.) should be coded as multiple numeric “dummy” features (see ?model.matrix
for how to automatically generate these in R).
An example dataset (derived from a bit of the ADHD200 sample is included:
set.seed(12345)
library(e1071)
library(mRFE)
data(input)
dim(input)
#> [1] 208 547
input[1:5,1:5]
#> DX2 Full4.IQ Age bankssts_SurfArea caudalanteriorcingulate_SurfArea
#> 1000804 TD 109 7.29 1041 647
#> 1023964 ADHD 123 8.29 1093 563
#> 1057962 ADHD 129 8.78 1502 738
#> 1099481 ADHD 116 8.04 826 475
#> 1127915 TD 124 12.44 1185 846
There are 208 observations (individual subjects), and 546 features. The first column (DX2
) contains the class labels (TD
: typically-developing control, ADHD
: attention deficit hyperactivity disorder).
To perform the feature ranking, use the svmRFE
function:
res <- svmRFE(input, k = 10, halve.above = 100)
Here we’ve indicated that we want k=10
for the k-fold cross validation as the “multiple” part of mSVM-RFE. To use standard SVM-RFE, you can use k=1
. Also notice the halve.above
parameter. This allows you to cut the features in half each round, instead of one by one. This is very useful for data sets with many features. Here we’ve set halve.above=100
, so the features will be cut in half each round until there are fewer than 100 remaining. The output is a vector of feature indices, now ordered from most to least “useful”.
Note that because of the randomness of the CV draws, features with close ranking scores, and the possible inclusion of useless features, these rankings can change some run to run. However, your output for this demo should be identical because we’ve all reset the random seed to the same value.
When exploring machine learning options, it is often useful to estimate generalization error and use this as a benchmark. However, it is important to remember that the feature selection step must be repeated from scratch on each training set in whatever cross validation or similar resampling scheme chosen. When feature selection is performed on a data set with many features, it will pick some truly useful features that will generalize, but it will also likely pick some useless features that, by mere chance, happened to align closely with the class labels of the training set. While including these features will give (spuriously) good performance if the error is estimated from this training set itself, the estimated performance will decrease to its true value when the classifier is applied to a true test set where these features are useless. Guyon et al. actually made this mistake in the example demos in their original SVM-RFE paper. This issue is outlined very nicely in “Selection bias in gene extraction on the basis of microarray gene-expression data”
Basically, the way to go is to wrap the entire feature selection and generalization error estimation process in a top-level loop of external cross validation. For 10-fold CV, we begin by defining which observations are in which folds.
nfold <- 10
nrows <- nrow(input)
folds <- rep(1:nfold, length.out = nrows)[sample(nrows)]
folds
#> [1] 5 6 3 8 1 6 9 6 6 5 4 3 6 3 4 2 1 7 2 10 6 3 5 1 6
#> [26] 10 9 10 10 1 8 7 10 9 7 1 7 6 4 4 8 10 6 3 7 9 10 3 1 5
#> [51] 9 4 3 10 10 8 4 3 1 9 6 6 3 8 2 5 5 7 5 9 7 7 8 9 2
#> [76] 8 4 1 1 10 2 9 6 9 7 6 3 4 10 5 1 2 6 9 6 10 1 7 1 5
#> [101] 9 9 4 8 1 2 9 2 5 4 4 3 8 8 5 4 4 1 2 2 7 2 6 8 4
#> [126] 7 2 3 8 1 6 5 3 10 8 4 10 9 2 8 10 1 10 9 7 2 1 3 2 3
#> [151] 9 1 8 9 2 1 6 7 5 4 7 2 7 6 8 3 2 5 10 2 5 4 8 7 7
#> [176] 3 2 5 1 9 5 10 9 4 8 8 3 5 7 3 3 4 5 8 2 5 4 6 6 10
#> [201] 7 7 10 3 4 8 5 1
In R, many parallel functions like to operate on lists, so next we’ll reformat folds
into a list, where each list element is a vector containing the test set indices for that fold.
folds <- split(seq_along(folds), folds)
folds
#> $`1`
#> [1] 5 17 24 30 36 49 59 78 79 91 97 99 105 118 130 142 147 152 156
#> [20] 179 208
#>
#> $`2`
#> [1] 16 19 65 75 81 92 106 108 119 120 122 127 139 146 149 155 162 167 170
#> [20] 177 195
#>
#> $`3`
#> [1] 3 12 14 22 44 48 53 58 63 87 112 128 133 148 150 166 176 187 190
#> [20] 191 204
#>
#> $`4`
#> [1] 11 15 39 40 52 57 77 88 103 110 111 116 117 125 136 160 172 184 192
#> [20] 197 205
#>
#> $`5`
#> [1] 1 10 23 50 66 67 69 90 100 109 115 132 159 168 171 178 181 188 193
#> [20] 196 207
#>
#> $`6`
#> [1] 2 6 8 9 13 21 25 38 43 61 62 83 86 93 95 123 131 157 164
#> [20] 198 199
#>
#> $`7`
#> [1] 18 32 35 37 45 68 71 72 85 98 121 126 145 158 161 163 174 175 189
#> [20] 201 202
#>
#> $`8`
#> [1] 4 31 41 56 64 73 76 104 113 114 124 129 135 140 153 165 173 185 186
#> [20] 194 206
#>
#> $`9`
#> [1] 7 27 34 46 51 60 70 74 82 84 94 101 102 107 138 144 151 154 180
#> [20] 183
#>
#> $`10`
#> [1] 20 26 28 29 33 42 47 54 55 80 89 96 134 137 141 143 169 182 200
#> [20] 203
Using lapply
, or one of its generic parallel cousins (ex: sge.parLapply
from the Rsge
package), we can now perform the feature ranking for all 10 training sets.
results <- lapply(folds, svmRFE.wrap, input, k = length(folds), halve.above = 100)
length(results)
#> [1] 10
head(results)
#> $`1`
#> $`1`$feature.ids
#> [1] 337 201 59 492 503 1 128 175 173 273 304 286 33 45 446 460 402 60
#> [19] 147 497 504 359 177 68 303 77 518 453 257 418 516 248 368 498 262 20
#> [37] 409 41 400 372 144 269 383 362 105 517 345 85 468 415 103 512 255 491
#> [55] 376 434 214 388 195 389 408 486 521 392 131 459 36 540 186 126 89 412
#> [73] 385 47 542 61 399 187 397 319 505 124 289 407 309 236 282 511 120 171
#> [91] 502 313 284 27 5 73 219 91 52 84 245 536 463 435 95 395 110 203
#> [109] 539 50 451 129 21 423 113 51 355 17 482 335 373 9 22 380 109 390
#> [127] 104 367 324 441 465 64 256 321 310 298 290 487 526 218 538 108 416 49
#> [145] 414 295 537 172 15 160 83 433 381 111 311 180 473 315 483 464 196 417
#> [163] 58 90 200 527 485 63 327 101 283 405 222 452 191 384 232 190 43 371
#> [181] 420 481 343 365 406 166 217 476 263 354 125 92 242 281 364 450 479 308
#> [199] 462 57 334 339 130 396 16 132 312 291 253 535 370 440 152 29 100 155
#> [217] 411 146 314 349 238 8 522 198 508 107 520 221 211 143 489 136 268 533
#> [235] 161 382 261 292 404 532 97 496 249 325 466 133 37 472 427 475 353 361
#> [253] 193 506 178 2 260 544 461 543 436 93 494 437 534 297 293 116 80 357
#> [271] 86 270 210 34 70 65 223 271 378 74 118 474 530 356 106 529 425 252
#> [289] 10 285 72 46 447 422 169 280 488 495 305 391 330 209 88 123 332 541
#> [307] 274 26 401 300 183 117 40 112 188 470 430 31 151 197 56 168 142 338
#> [325] 287 212 265 510 277 394 331 307 98 163 328 42 350 445 288 62 369 99
#> [343] 12 7 454 363 239 227 149 30 140 115 162 233 366 231 342 360 182 206
#> [361] 438 266 170 53 153 165 135 75 351 228 329 299 316 326 500 323 215 322
#> [379] 156 226 469 514 499 35 276 379 439 348 254 79 341 66 150 344 174 259
#> [397] 32 480 148 225 202 431 301 54 55 134 19 243 306 347 403 237 234 127
#> [415] 455 192 94 158 121 240 375 114 296 410 44 235 513 69 244 320 216 507
#> [433] 48 484 184 477 141 194 428 456 515 278 137 204 23 119 241 398 154 251
#> [451] 159 387 457 413 102 87 449 176 185 467 374 531 24 442 25 246 250 519
#> [469] 336 13 358 81 122 179 229 317 340 3 82 71 18 167 501 207 258 424
#> [487] 275 11 443 164 352 38 444 523 96 448 264 524 458 39 208 189 224 67
#> [505] 199 478 157 220 318 6 386 279 230 333 346 471 139 294 205 302 426 377
#> [523] 28 545 14 181 4 76 138 528 525 78 421 213 145 546 432 419 509 247
#> [541] 493 267 429 490 272 393
#>
#> $`1`$train.data.ids
#> [1] "1000804" "1023964" "1057962" "1099481" "1187766" "1208795" "1283494"
#> [8] "1320247" "1359325" "1435954" "1471736" "1497055" "1511464" "1517240"
#> [15] "1567356" "1737393" "1740607" "1780174" "1854959" "1875084" "1884448"
#> [22] "1934623" "1992284" "1995121" "2030383" "2054438" "2136051" "2230510"
#> [29] "2260910" "2306976" "2497695" "2682736" "2730704" "2735617" "2741068"
#> [36] "2773205" "2821683" "2854839" "2907383" "2950672" "2983819" "2991307"
#> [43] "2996531" "3163200" "3174224" "3235580" "3243657" "3349205" "3349423"
#> [50] "3433846" "3441455" "3457975" "3542588" "3601861" "3619797" "3650634"
#> [57] "3653737" "3679455" "3845761" "3999344" "4060823" "4079254" "4084645"
#> [64] "4095229" "4116166" "4154672" "4164316" "4187857" "4562206" "4827048"
#> [71] "6206397" "6568351" "8009688" "8415034" "8692452" "8697774" "8834383"
#> [78] "9326955" "9578663" "9750701" "9907452" "10004" "10016" "10009"
#> [85] "10010" "10012" "10020" "10022" "10023" "10028" "10030"
#> [92] "10038" "10031" "10032" "10035" "10037" "10039" "10042"
#> [99] "10044" "10050" "10051" "10052" "10053" "10056" "10058"
#> [106] "10059" "10054" "10048" "10060" "10047" "10062" "10064"
#> [113] "10065" "10066" "10068" "10119" "10109" "10110" "10111"
#> [120] "10112" "10107" "10019" "10077" "10078" "10079" "10080"
#> [127] "10014" "10089" "10090" "10091" "10045" "10093" "10094"
#> [134] "10095" "10097" "10098" "10099" "10101" "10102" "10108"
#> [141] "10002" "10011" "10113" "10114" "10115" "10006" "10105"
#> [148] "10024" "10106" "10116" "10117" "10118" "10120" "10026"
#> [155] "10121" "10122" "10123" "10082" "10074" "10076" "10103"
#> [162] "10081" "10083" "10084" "10085" "10036" "10086" "10027"
#> [169] "10087" "10025" "10029" "10017" "10049" "10070" "10071"
#> [176] "10033" "10072" "10073" "10104" "10088" "10040" "10124"
#> [183] "10125" "10007" "10126" "10127" "10128"
#>
#> $`1`$test.data.ids
#> [1] "1127915" "1700637" "1918630" "2107638" "2570769" "3011311" "3518345"
#> [8] "5164727" "5971050" "10001" "10018" "10021" "10034" "10057"
#> [15] "10069" "10008" "10092" "10096" "10100" "10075" "10129"
#>
#>
#> $`2`
#> $`2`$feature.ids
#> [1] 380 173 33 262 187 459 304 492 218 31 68 1 541 416 409 126 313 92
#> [19] 248 266 337 503 177 133 72 256 175 95 255 546 365 214 217 476 391 468
#> [37] 488 388 236 219 186 502 460 435 415 512 73 345 107 89 282 59 540 149
#> [55] 305 227 108 212 144 402 466 487 383 43 21 418 362 525 491 411 450 268
#> [73] 370 311 443 407 160 319 496 124 392 340 17 521 61 234 85 527 461 361
#> [91] 303 35 120 129 423 284 516 27 412 191 453 354 356 79 389 301 265 36
#> [109] 30 201 39 131 50 367 320 330 479 481 471 182 252 308 498 286 446 421
#> [127] 490 243 127 287 140 244 542 229 359 196 544 141 66 16 394 422 384 202
#> [145] 397 163 403 458 22 441 385 200 449 152 327 172 143 349 112 52 472 49
#> [163] 332 369 497 80 538 179 455 400 197 45 208 188 401 91 176 505 529 259
#> [181] 433 47 71 364 316 430 533 273 518 183 233 298 128 434 537 257 524 375
#> [199] 368 60 464 390 274 239 522 130 166 51 317 148 105 94 357 18 171 57
#> [217] 275 469 534 297 499 431 161 413 440 414 240 213 194 465 442 331 193 532
#> [235] 508 405 86 195 125 336 67 117 103 277 270 261 432 198 494 104 41 526
#> [253] 486 26 323 228 232 539 427 473 519 48 62 116 292 90 54 504 264 211
#> [271] 395 425 156 444 306 520 509 203 84 65 64 38 5 290 543 467 230 216
#> [289] 238 420 258 263 324 322 168 307 485 478 447 134 463 454 136 312 111 382
#> [307] 283 246 250 34 207 205 184 199 451 121 56 269 482 260 506 378 452 281
#> [325] 23 381 253 19 457 204 100 376 226 343 245 289 6 377 511 96 10 58
#> [343] 386 501 399 192 169 113 150 271 209 417 231 225 335 355 321 445 315 448
#> [361] 123 53 437 404 185 78 350 28 3 7 339 88 495 165 438 147 101 295
#> [379] 325 237 360 206 408 294 249 326 474 118 299 371 99 309 83 462 366 145
#> [397] 329 302 342 341 374 285 69 135 98 109 224 419 74 63 291 180 15 139
#> [415] 110 531 8 146 174 189 106 40 151 348 480 154 470 372 396 220 158 14
#> [433] 483 536 406 247 328 132 410 379 93 426 162 293 159 318 424 37 363 20
#> [451] 42 300 334 278 142 515 507 82 530 157 97 170 55 535 314 288 190 223
#> [469] 221 2 279 242 373 12 393 436 222 9 77 500 11 119 241 210 267 87
#> [487] 114 164 155 215 387 167 489 333 475 428 70 181 344 528 272 296 13 251
#> [505] 76 338 46 235 44 32 122 178 429 137 477 138 456 254 439 4 276 351
#> [523] 347 29 24 346 352 353 280 484 153 81 517 358 310 75 514 510 545 398
#> [541] 115 102 513 25 493 523
#>
#> $`2`$train.data.ids
#> [1] "1000804" "1023964" "1057962" "1099481" "1127915" "1187766" "1208795"
#> [8] "1283494" "1320247" "1359325" "1435954" "1471736" "1497055" "1511464"
#> [15] "1517240" "1700637" "1737393" "1780174" "1854959" "1875084" "1884448"
#> [22] "1918630" "1934623" "1992284" "1995121" "2030383" "2054438" "2107638"
#> [29] "2136051" "2230510" "2260910" "2306976" "2497695" "2570769" "2682736"
#> [36] "2730704" "2735617" "2741068" "2773205" "2821683" "2854839" "2907383"
#> [43] "2950672" "2983819" "2991307" "2996531" "3011311" "3163200" "3174224"
#> [50] "3235580" "3243657" "3349205" "3349423" "3433846" "3441455" "3457975"
#> [57] "3518345" "3542588" "3601861" "3619797" "3650634" "3653737" "3845761"
#> [64] "3999344" "4060823" "4079254" "4084645" "4095229" "4116166" "4154672"
#> [71] "4164316" "4562206" "4827048" "5164727" "5971050" "6206397" "8009688"
#> [78] "8415034" "8692452" "8697774" "8834383" "9326955" "9578663" "9750701"
#> [85] "9907452" "10001" "10016" "10009" "10010" "10012" "10018"
#> [92] "10020" "10021" "10022" "10023" "10028" "10030" "10038"
#> [99] "10034" "10032" "10037" "10039" "10042" "10044" "10050"
#> [106] "10051" "10052" "10053" "10056" "10057" "10054" "10060"
#> [113] "10047" "10062" "10064" "10066" "10068" "10069" "10119"
#> [120] "10109" "10110" "10111" "10112" "10107" "10019" "10077"
#> [127] "10079" "10080" "10008" "10014" "10089" "10090" "10092"
#> [134] "10045" "10094" "10095" "10096" "10097" "10098" "10100"
#> [141] "10101" "10102" "10108" "10002" "10011" "10114" "10115"
#> [148] "10006" "10105" "10106" "10116" "10118" "10120" "10026"
#> [155] "10121" "10122" "10123" "10074" "10075" "10076" "10103"
#> [162] "10081" "10083" "10084" "10085" "10036" "10086" "10027"
#> [169] "10087" "10025" "10029" "10017" "10049" "10070" "10033"
#> [176] "10072" "10073" "10104" "10088" "10040" "10124" "10125"
#> [183] "10007" "10126" "10127" "10128" "10129"
#>
#> $`2`$test.data.ids
#> [1] "1567356" "1740607" "3679455" "4187857" "6568351" "10004" "10031"
#> [8] "10035" "10058" "10059" "10048" "10065" "10078" "10091"
#> [15] "10093" "10099" "10113" "10024" "10117" "10082" "10071"
#>
#>
#> $`3`
#> $`3`$feature.ids
#> [1] 337 201 1 36 518 173 262 62 90 187 340 175 128 129 203 256 68 172
#> [19] 391 217 219 113 92 147 313 107 460 362 255 468 479 466 441 497 383 263
#> [37] 144 365 453 95 408 409 332 301 191 181 432 491 421 268 504 122 60 512
#> [55] 345 481 540 269 314 527 22 244 415 492 376 200 459 440 399 392 120 186
#> [73] 418 45 319 388 41 27 65 357 446 478 252 405 470 498 117 265 94 26
#> [91] 544 532 412 486 33 359 397 350 248 422 542 371 503 246 276 49 464 318
#> [109] 85 402 153 416 521 281 99 227 133 43 482 169 473 303 282 180 72 434
#> [127] 59 451 61 513 230 496 218 355 431 511 378 12 66 420 293 103 253 130
#> [145] 160 395 21 450 136 374 396 516 356 100 284 505 411 208 407 222 385 50
#> [163] 216 266 349 234 382 389 273 37 528 140 179 108 274 233 86 35 321 406
#> [181] 286 533 192 297 423 236 520 425 126 283 16 251 177 449 73 306 384 381
#> [199] 159 232 472 183 195 348 502 112 361 47 463 419 149 125 546 295 367 353
#> [217] 158 213 243 91 176 325 427 15 40 461 475 84 10 57 509 487 435 400
#> [235] 442 338 51 124 67 257 188 538 430 525 329 202 210 360 438 312 64 522
#> [253] 245 310 46 448 78 401 379 83 476 161 123 336 483 316 524 322 5 197
#> [271] 334 155 198 424 287 539 196 54 141 52 537 89 275 82 292 485 205 3
#> [289] 19 278 471 167 199 209 444 114 211 207 364 190 343 237 231 433 39 331
#> [307] 121 366 403 18 280 291 20 8 465 369 17 523 58 304 393 543 285 105
#> [325] 238 212 214 294 484 97 194 333 429 32 488 307 2 98 390 42 118 150
#> [343] 154 320 469 111 193 24 455 308 462 104 347 55 375 258 157 309 101 182
#> [361] 536 277 510 452 354 146 501 109 9 53 131 206 189 166 88 69 134 259
#> [379] 215 74 75 489 457 506 151 339 260 261 127 417 515 110 426 529 351 341
#> [397] 29 352 443 414 404 6 377 323 170 132 494 116 63 507 71 330 242 477
#> [415] 413 264 508 96 168 305 317 31 467 526 541 335 119 447 38 249 184 28
#> [433] 342 535 514 531 79 386 517 81 137 174 445 250 185 224 519 439 380 145
#> [451] 410 225 80 296 372 229 272 226 178 315 13 239 143 270 70 370 235 493
#> [469] 490 48 34 76 87 204 456 4 221 288 165 290 115 289 327 163 228 299
#> [487] 387 436 241 44 267 368 102 326 30 474 23 344 138 254 302 545 77 152
#> [505] 56 428 156 500 164 328 25 271 142 300 14 311 398 139 437 247 240 495
#> [523] 7 394 346 298 93 363 220 454 162 279 106 148 373 358 223 11 324 480
#> [541] 499 458 530 135 171 534
#>
#> $`3`$train.data.ids
#> [1] "1000804" "1023964" "1099481" "1127915" "1187766" "1208795" "1283494"
#> [8] "1320247" "1359325" "1435954" "1497055" "1517240" "1567356" "1700637"
#> [15] "1737393" "1740607" "1780174" "1854959" "1884448" "1918630" "1934623"
#> [22] "1992284" "1995121" "2030383" "2054438" "2107638" "2136051" "2230510"
#> [29] "2260910" "2306976" "2497695" "2570769" "2682736" "2730704" "2735617"
#> [36] "2741068" "2773205" "2821683" "2854839" "2950672" "2983819" "2991307"
#> [43] "3011311" "3163200" "3174224" "3235580" "3349205" "3349423" "3433846"
#> [50] "3441455" "3518345" "3542588" "3601861" "3619797" "3653737" "3679455"
#> [57] "3845761" "3999344" "4060823" "4079254" "4084645" "4095229" "4116166"
#> [64] "4154672" "4164316" "4187857" "4562206" "4827048" "5164727" "5971050"
#> [71] "6206397" "6568351" "8009688" "8415034" "8692452" "8697774" "8834383"
#> [78] "9578663" "9750701" "9907452" "10001" "10004" "10016" "10009"
#> [85] "10010" "10012" "10018" "10020" "10021" "10022" "10023"
#> [92] "10028" "10030" "10038" "10034" "10031" "10032" "10035"
#> [99] "10037" "10039" "10042" "10050" "10051" "10052" "10053"
#> [106] "10056" "10057" "10058" "10059" "10054" "10048" "10060"
#> [113] "10047" "10062" "10064" "10065" "10068" "10069" "10119"
#> [120] "10109" "10111" "10112" "10107" "10019" "10077" "10078"
#> [127] "10079" "10080" "10008" "10014" "10089" "10090" "10091"
#> [134] "10092" "10093" "10095" "10096" "10097" "10098" "10099"
#> [141] "10100" "10101" "10102" "10108" "10002" "10011" "10113"
#> [148] "10114" "10115" "10006" "10024" "10106" "10116" "10117"
#> [155] "10118" "10120" "10026" "10121" "10122" "10082" "10074"
#> [162] "10075" "10076" "10103" "10081" "10083" "10084" "10085"
#> [169] "10036" "10027" "10087" "10017" "10049" "10070" "10071"
#> [176] "10033" "10072" "10073" "10104" "10088" "10040" "10124"
#> [183] "10125" "10126" "10127" "10128" "10129"
#>
#> $`3`$test.data.ids
#> [1] "1057962" "1471736" "1511464" "1875084" "2907383" "2996531" "3243657"
#> [8] "3457975" "3650634" "9326955" "10044" "10066" "10110" "10045"
#> [15] "10094" "10105" "10123" "10086" "10025" "10029" "10007"
#>
#>
#> $`4`
#> $`4`$feature.ids
#> [1] 262 397 33 319 337 173 382 108 90 128 511 130 85 203 136 133 184 383
#> [19] 283 186 512 522 89 479 463 140 58 460 485 411 331 308 218 172 339 219
#> [37] 266 126 368 504 362 105 236 498 255 363 45 207 92 415 66 435 281 304
#> [55] 390 402 180 65 103 360 109 68 422 263 395 428 349 354 468 117 303 284
#> [73] 27 212 41 492 542 497 22 473 269 144 477 259 502 452 95 418 412 370
#> [91] 229 268 187 532 391 265 396 388 174 200 505 405 392 481 113 453 385 176
#> [109] 525 356 274 47 400 491 486 59 437 150 208 253 240 409 201 295 32 194
#> [127] 72 62 305 205 459 177 376 164 399 503 352 403 101 124 282 475 521 367
#> [145] 163 244 257 17 464 149 508 381 100 446 313 162 476 67 34 416 125 28
#> [163] 373 114 1 273 537 344 57 267 60 158 49 77 544 120 355 147 198 87
#> [181] 527 110 196 351 232 407 287 474 35 369 191 40 175 361 36 359 340 297
#> [199] 71 434 286 472 251 107 332 483 441 153 292 94 2 499 197 448 3 8
#> [217] 51 524 104 256 488 432 494 518 378 470 301 450 365 129 86 43 46 298
#> [235] 245 171 21 465 461 447 183 496 289 223 350 9 278 425 193 546 258 195
#> [253] 44 357 293 478 155 540 123 42 520 315 336 327 404 167 83 48 55 462
#> [271] 26 96 442 134 264 211 316 440 353 419 270 189 509 311 54 406 76 216
#> [289] 156 222 141 127 261 489 429 414 466 154 330 170 325 515 545 271 506 445
#> [307] 541 384 148 529 514 487 536 220 214 221 210 217 533 469 79 169 321 131
#> [325] 424 166 151 386 18 366 227 342 234 455 454 444 516 118 237 343 246 16
#> [343] 401 467 543 314 307 277 188 457 160 364 249 209 235 387 329 238 374 393
#> [361] 280 433 398 410 202 111 309 490 230 142 112 52 190 276 146 73 84 24
#> [379] 317 242 31 310 15 335 239 93 115 348 137 318 204 260 371 159 300 165
#> [397] 91 493 389 334 25 417 226 526 517 39 157 241 5 182 458 106 78 495
#> [415] 228 7 275 534 338 528 4 224 38 243 539 456 420 426 372 145 63 152
#> [433] 530 484 225 192 531 14 199 328 178 252 501 380 322 449 61 375 138 97
#> [451] 482 439 296 70 347 30 408 358 19 294 231 50 288 82 500 248 320 69
#> [469] 102 507 13 430 538 306 64 436 513 161 254 98 451 135 119 122 80 438
#> [487] 431 37 75 413 523 326 143 213 168 341 345 139 279 185 233 181 535 179
#> [505] 121 81 74 302 12 56 285 132 88 53 116 312 323 290 272 377 421 206
#> [523] 6 29 480 394 250 471 215 20 23 324 10 346 11 379 519 247 299 291
#> [541] 333 99 427 443 510 423
#>
#> $`4`$train.data.ids
#> [1] "1000804" "1023964" "1057962" "1099481" "1127915" "1187766" "1208795"
#> [8] "1283494" "1320247" "1359325" "1471736" "1497055" "1511464" "1567356"
#> [15] "1700637" "1737393" "1740607" "1780174" "1854959" "1875084" "1884448"
#> [22] "1918630" "1934623" "1992284" "1995121" "2030383" "2054438" "2107638"
#> [29] "2136051" "2230510" "2260910" "2306976" "2497695" "2570769" "2682736"
#> [36] "2730704" "2773205" "2821683" "2854839" "2907383" "2950672" "2983819"
#> [43] "2991307" "2996531" "3011311" "3163200" "3174224" "3243657" "3349205"
#> [50] "3349423" "3433846" "3457975" "3518345" "3542588" "3601861" "3619797"
#> [57] "3650634" "3653737" "3679455" "3845761" "3999344" "4060823" "4079254"
#> [64] "4084645" "4095229" "4116166" "4154672" "4164316" "4187857" "4562206"
#> [71] "5164727" "5971050" "6206397" "6568351" "8009688" "8415034" "8692452"
#> [78] "8697774" "8834383" "9326955" "9750701" "9907452" "10001" "10004"
#> [85] "10016" "10009" "10010" "10012" "10018" "10020" "10021"
#> [92] "10022" "10023" "10028" "10038" "10034" "10031" "10032"
#> [99] "10035" "10037" "10044" "10050" "10051" "10052" "10057"
#> [106] "10058" "10059" "10054" "10048" "10060" "10047" "10064"
#> [113] "10065" "10066" "10068" "10069" "10119" "10109" "10110"
#> [120] "10111" "10112" "10019" "10077" "10078" "10079" "10080"
#> [127] "10008" "10014" "10089" "10090" "10091" "10092" "10045"
#> [134] "10093" "10094" "10095" "10096" "10097" "10098" "10099"
#> [141] "10100" "10101" "10102" "10108" "10011" "10113" "10114"
#> [148] "10115" "10006" "10105" "10024" "10106" "10116" "10117"
#> [155] "10118" "10026" "10121" "10122" "10123" "10082" "10074"
#> [162] "10075" "10076" "10103" "10081" "10083" "10085" "10036"
#> [169] "10086" "10027" "10087" "10025" "10029" "10049" "10070"
#> [176] "10071" "10033" "10073" "10104" "10088" "10040" "10124"
#> [183] "10125" "10007" "10127" "10128" "10129"
#>
#> $`4`$test.data.ids
#> [1] "1435954" "1517240" "2735617" "2741068" "3235580" "3441455" "4827048"
#> [8] "9578663" "10030" "10039" "10042" "10053" "10056" "10062"
#> [15] "10107" "10002" "10120" "10084" "10017" "10072" "10126"
#>
#>
#> $`5`
#> $`5`$feature.ids
#> [1] 173 337 304 262 402 544 175 1 59 63 418 94 129 205 108 502 512 504
#> [19] 35 503 128 177 335 46 412 492 345 287 327 181 527 409 505 542 460 319
#> [37] 140 169 391 450 468 282 395 73 43 388 405 359 68 465 497 479 100 362
#> [55] 191 364 354 321 453 368 49 508 266 66 217 61 360 434 422 473 200 212
#> [73] 95 520 291 491 105 303 172 532 244 109 355 168 47 414 144 313 298 452
#> [91] 27 45 219 378 537 389 70 501 201 448 119 111 256 187 206 147 466 459
#> [109] 361 521 343 392 41 26 415 379 308 440 149 464 207 411 74 127 310 518
#> [127] 180 186 136 397 446 123 463 198 107 451 131 485 114 257 55 232 407 106
#> [145] 163 146 195 478 227 325 231 383 381 264 408 545 390 356 179 259 50 410
#> [163] 33 103 522 311 400 118 89 369 416 332 255 539 60 120 423 403 437 324
#> [181] 192 384 199 349 183 396 78 292 546 274 336 220 376 498 36 523 506 441
#> [199] 30 44 253 252 525 524 263 510 40 270 71 154 288 435 281 13 90 218
#> [217] 461 29 353 417 385 329 101 449 124 83 514 444 318 138 184 267 331 301
#> [235] 320 286 476 32 277 197 543 22 162 57 242 80 64 21 469 526 365 275
#> [253] 110 171 511 261 52 424 159 196 346 406 233 457 386 228 538 97 152 370
#> [271] 529 443 494 425 285 509 91 221 135 419 295 305 367 99 248 439 279 350
#> [289] 6 160 394 438 348 82 69 122 316 189 182 75 273 3 203 528 276 126
#> [307] 234 401 380 399 420 284 481 513 315 475 229 299 216 170 112 404 246 429
#> [325] 358 507 382 145 540 333 306 293 516 247 236 535 474 98 18 208 372 9
#> [343] 330 338 48 211 351 300 312 178 536 137 240 151 241 5 366 230 499 445
#> [361] 117 515 375 188 398 209 25 283 302 77 79 125 202 51 134 260 428 7
#> [379] 150 161 2 14 96 489 85 113 442 20 225 190 65 222 237 226 250 54
#> [397] 426 534 16 317 4 530 344 224 104 339 363 92 323 194 377 352 347 533
#> [415] 238 322 116 155 484 166 167 371 56 373 37 185 271 436 342 387 132 156
#> [433] 53 8 431 393 483 28 72 143 23 251 519 204 289 130 500 235 12 487
#> [451] 265 84 15 413 472 296 427 193 374 76 268 541 309 328 158 269 62 38
#> [469] 517 86 477 447 490 24 462 17 139 121 334 433 272 93 280 245 88 133
#> [487] 239 157 42 214 531 307 314 34 87 432 174 341 141 480 102 148 249 493
#> [505] 254 81 326 294 31 176 210 467 297 456 458 495 488 482 486 455 153 258
#> [523] 340 471 454 278 421 11 165 290 39 470 357 496 223 215 10 213 115 164
#> [541] 67 58 430 19 142 243
#>
#> $`5`$train.data.ids
#> [1] "1023964" "1057962" "1099481" "1127915" "1187766" "1208795" "1283494"
#> [8] "1320247" "1435954" "1471736" "1497055" "1511464" "1517240" "1567356"
#> [15] "1700637" "1737393" "1740607" "1780174" "1854959" "1875084" "1918630"
#> [22] "1934623" "1992284" "1995121" "2030383" "2054438" "2107638" "2136051"
#> [29] "2230510" "2260910" "2306976" "2497695" "2570769" "2682736" "2730704"
#> [36] "2735617" "2741068" "2773205" "2821683" "2854839" "2907383" "2950672"
#> [43] "2983819" "2991307" "2996531" "3011311" "3174224" "3235580" "3243657"
#> [50] "3349205" "3349423" "3433846" "3441455" "3457975" "3518345" "3542588"
#> [57] "3601861" "3619797" "3650634" "3653737" "3679455" "4060823" "4084645"
#> [64] "4095229" "4116166" "4154672" "4164316" "4187857" "4562206" "4827048"
#> [71] "5164727" "5971050" "6206397" "6568351" "8009688" "8415034" "8692452"
#> [78] "8697774" "8834383" "9326955" "9578663" "9750701" "10001" "10004"
#> [85] "10016" "10009" "10010" "10012" "10018" "10020" "10021"
#> [92] "10023" "10028" "10030" "10038" "10034" "10031" "10032"
#> [99] "10035" "10039" "10042" "10044" "10050" "10051" "10053"
#> [106] "10056" "10057" "10058" "10059" "10054" "10048" "10060"
#> [113] "10047" "10062" "10064" "10065" "10066" "10068" "10069"
#> [120] "10119" "10110" "10111" "10112" "10107" "10019" "10077"
#> [127] "10078" "10079" "10080" "10008" "10014" "10089" "10090"
#> [134] "10091" "10092" "10045" "10093" "10094" "10095" "10096"
#> [141] "10097" "10098" "10099" "10100" "10101" "10102" "10002"
#> [148] "10011" "10113" "10114" "10115" "10006" "10105" "10024"
#> [155] "10116" "10117" "10120" "10026" "10121" "10122" "10123"
#> [162] "10082" "10075" "10076" "10081" "10083" "10084" "10085"
#> [169] "10036" "10086" "10087" "10025" "10029" "10017" "10070"
#> [176] "10071" "10072" "10073" "10104" "10088" "10040" "10124"
#> [183] "10125" "10007" "10126" "10127" "10129"
#>
#> $`5`$test.data.ids
#> [1] "1000804" "1359325" "1884448" "3163200" "3845761" "3999344" "4079254"
#> [8] "9907452" "10022" "10037" "10052" "10109" "10108" "10106"
#> [15] "10118" "10074" "10103" "10027" "10049" "10033" "10128"
#>
#>
#> $`6`
#> $`6`$feature.ids
#> [1] 262 337 173 304 33 130 255 521 308 41 22 286 175 157 120 95 365 140
#> [19] 8 319 43 187 218 479 383 1 79 487 243 497 542 228 259 38 352 169
#> [37] 395 434 77 21 265 183 502 466 105 19 504 185 388 524 405 411 59 31
#> [55] 62 16 392 171 144 357 345 73 397 52 313 409 520 498 468 402 396 491
#> [73] 186 460 422 453 544 511 536 391 385 273 27 89 85 172 113 436 440 68
#> [91] 129 532 441 384 446 486 45 342 403 127 505 314 128 473 107 293 518 149
#> [109] 478 227 188 519 459 452 181 256 415 219 496 375 546 203 195 418 253 508
#> [127] 214 201 216 198 199 281 370 354 355 86 378 36 492 261 91 245 356 217
#> [145] 509 117 249 103 420 537 83 112 465 194 231 360 488 271 513 104 47 361
#> [163] 512 463 133 349 442 389 12 213 124 516 343 143 257 278 266 341 449 425
#> [181] 230 285 426 503 450 432 90 335 232 49 34 506 274 527 398 190 39 329
#> [199] 139 412 110 282 367 53 381 364 60 63 407 244 207 481 131 270 435 239
#> [217] 305 353 427 212 191 65 464 303 525 366 141 406 69 289 454 401 347 279
#> [235] 386 202 176 288 145 177 235 495 526 17 309 5 371 135 263 161 538 246
#> [253] 500 334 445 423 295 84 283 108 205 267 179 543 416 61 330 200 539 475
#> [271] 275 534 344 25 501 390 121 456 2 322 136 528 87 67 72 469 101 380
#> [289] 193 269 535 209 368 480 237 419 56 376 116 210 32 291 15 10 522 310
#> [307] 74 44 50 93 9 350 197 206 196 154 290 147 238 132 248 280 221 444
#> [325] 184 76 251 46 467 387 264 215 80 111 240 180 321 404 160 530 11 115
#> [343] 40 94 58 55 541 301 150 408 429 424 170 348 159 92 483 82 312 54
#> [361] 134 99 451 358 81 146 29 316 447 363 472 51 226 306 461 88 443 458
#> [379] 64 223 299 470 336 437 277 125 260 302 455 448 57 254 225 297 158 373
#> [397] 431 340 323 531 109 151 30 417 477 324 482 163 333 97 433 14 123 414
#> [415] 48 529 174 18 327 26 377 211 476 372 540 102 192 494 272 153 252 126
#> [433] 328 317 276 78 346 167 507 413 66 242 318 287 42 156 490 462 138 70
#> [451] 428 400 122 165 307 35 258 332 485 100 374 182 489 474 189 421 7 284
#> [469] 331 471 118 114 320 362 294 457 166 24 292 152 514 300 164 296 545 119
#> [487] 298 222 325 484 499 37 142 155 379 359 178 523 162 3 106 71 399 234
#> [505] 339 393 351 6 439 515 250 315 229 98 247 236 533 233 204 394 224 168
#> [523] 369 311 20 137 208 241 23 382 510 220 4 75 493 28 326 438 410 268
#> [541] 517 338 430 148 96 13
#>
#> $`6`$train.data.ids
#> [1] "1000804" "1057962" "1099481" "1127915" "1208795" "1359325" "1435954"
#> [8] "1471736" "1511464" "1517240" "1567356" "1700637" "1737393" "1740607"
#> [15] "1780174" "1875084" "1884448" "1918630" "1992284" "1995121" "2030383"
#> [22] "2054438" "2107638" "2136051" "2230510" "2260910" "2306976" "2497695"
#> [29] "2570769" "2682736" "2735617" "2741068" "2773205" "2821683" "2907383"
#> [36] "2950672" "2983819" "2991307" "2996531" "3011311" "3163200" "3174224"
#> [43] "3235580" "3243657" "3349205" "3349423" "3433846" "3441455" "3457975"
#> [50] "3518345" "3542588" "3650634" "3653737" "3679455" "3845761" "3999344"
#> [57] "4060823" "4079254" "4084645" "4095229" "4116166" "4154672" "4164316"
#> [64] "4187857" "4562206" "4827048" "5164727" "5971050" "6206397" "6568351"
#> [71] "8009688" "8692452" "8697774" "9326955" "9578663" "9750701" "9907452"
#> [78] "10001" "10004" "10009" "10012" "10018" "10020" "10021"
#> [85] "10022" "10023" "10028" "10030" "10038" "10034" "10031"
#> [92] "10032" "10035" "10037" "10039" "10042" "10044" "10050"
#> [99] "10051" "10052" "10053" "10056" "10057" "10058" "10059"
#> [106] "10054" "10048" "10047" "10062" "10064" "10065" "10066"
#> [113] "10068" "10069" "10109" "10110" "10111" "10112" "10107"
#> [120] "10019" "10077" "10078" "10079" "10080" "10008" "10014"
#> [127] "10089" "10090" "10091" "10092" "10045" "10093" "10094"
#> [134] "10095" "10096" "10097" "10098" "10099" "10100" "10102"
#> [141] "10108" "10002" "10011" "10113" "10114" "10006" "10105"
#> [148] "10024" "10106" "10116" "10117" "10118" "10120" "10026"
#> [155] "10121" "10122" "10123" "10082" "10074" "10075" "10076"
#> [162] "10103" "10081" "10083" "10084" "10085" "10036" "10086"
#> [169] "10027" "10087" "10025" "10029" "10017" "10049" "10070"
#> [176] "10071" "10033" "10072" "10088" "10040" "10124" "10125"
#> [183] "10007" "10126" "10127" "10128" "10129"
#>
#> $`6`$test.data.ids
#> [1] "1023964" "1187766" "1283494" "1320247" "1497055" "1854959" "1934623"
#> [8] "2730704" "2854839" "3601861" "3619797" "8415034" "8834383" "10016"
#> [15] "10010" "10060" "10119" "10101" "10115" "10073" "10104"
Each list element in results
contains the feature rankings for that fold (feature.ids
), as well as the training set row names used to obtain them (train.data.ids
). The remaining test set row names are included as well (test.data.ids
).
If we were going to apply these findings to a final test set somewhere, we would still want the best features across all of this training data.
top.features <- WriteFeatures(results, input, save=FALSE)
plot(top.features$AvgRank)
Ordered by average rank across the 10 folds (AvgRank
, lower numbers are better), this gives us a list of the feature names (FeatureName
, i.e. the corresponding column name from input
), as well as the feature indices (FeatureID
, i.e. the corresponding column index from input
minus 1).
Now that we have a ranking of features for each of the 10 training sets, the final step is to estimate the generalization error we can expect if we were to train a final classifier on these features and apply it to a new test set. Here, a radial basis function kernel SVM is tuned on each training set independently. This consists of doing internal 10-fold CV error estimation at each combination of SVM hyperparameters (Cost and Gamma) using grid search. The optimal parameters are then used to train the SVM on the entire training set. Finally, generalization error is determined by predicting the corresponding test set. This is done for each fold in the external 10-fold CV, and all 10 of these generalization error estimates are averaged to give more stability. This process is repeated while varying the number of top features that are used as input, and there will typically be a “sweet spot” where there are not too many nor too few features. Outlined, this process looks like:
external 10x CV
Rank features with mSVM-RFE
for(nfeat=1 to 500ish)
Grid search over SVM parameters
10x CV
Train SVM
Obtain generalization error estimate
Average generalization error estimate across multiple folds
Choose parameters with best average performance
Train SVM on full training set
Obtain generalization error estimate on corresponding external CV test set
Average generalization errors across multiple folds
Choose the optimum number of features
To implement it over the top 5 features, we do:
featsweep <- lapply(1:5, FeatSweep.wrap, results, input)
featsweep
#> [[1]]
#> [[1]]$svm.list
#> [[1]]$svm.list$`1`
#> gamma cost error dispersion
#> 1 0.03125 64 0.4761905 NA
#>
#> [[1]]$svm.list$`2`
#> gamma cost error dispersion
#> 1 0.0002441406 0.015625 0.6190476 NA
#>
#> [[1]]$svm.list$`3`
#> gamma cost error dispersion
#> 1 1 4 0.4761905 NA
#>
#> [[1]]$svm.list$`4`
#> gamma cost error dispersion
#> 1 1 1 0.3333333 NA
#>
#> [[1]]$svm.list$`5`
#> gamma cost error dispersion
#> 1 0.5 0.5 0.3809524 NA
#>
#> [[1]]$svm.list$`6`
#> gamma cost error dispersion
#> 1 0.001953125 64 0.5238095 NA
#>
#> [[1]]$svm.list$`7`
#> gamma cost error dispersion
#> 1 0.0078125 32 0.4285714 NA
#>
#> [[1]]$svm.list$`8`
#> gamma cost error dispersion
#> 1 0.25 0.125 0.4761905 NA
#>
#> [[1]]$svm.list$`9`
#> gamma cost error dispersion
#> 1 1 16 0.6 NA
#>
#> [[1]]$svm.list$`10`
#> gamma cost error dispersion
#> 1 0.5 0.5 0.45 NA
#>
#>
#> [[1]]$error
#> [1] 0.4764286
#>
#>
#> [[2]]
#> [[2]]$svm.list
#> [[2]]$svm.list$`1`
#> gamma cost error dispersion
#> 1 0.015625 64 0.2857143 NA
#>
#> [[2]]$svm.list$`2`
#> gamma cost error dispersion
#> 1 0.25 8 0.6190476 NA
#>
#> [[2]]$svm.list$`3`
#> gamma cost error dispersion
#> 1 0.015625 8 0.4285714 NA
#>
#> [[2]]$svm.list$`4`
#> gamma cost error dispersion
#> 1 0.5 32 0.3333333 NA
#>
#> [[2]]$svm.list$`5`
#> gamma cost error dispersion
#> 1 0.03125 16 0.3809524 NA
#>
#> [[2]]$svm.list$`6`
#> gamma cost error dispersion
#> 1 0.0625 1 0.4761905 NA
#>
#> [[2]]$svm.list$`7`
#> gamma cost error dispersion
#> 1 0.25 32 0.3809524 NA
#>
#> [[2]]$svm.list$`8`
#> gamma cost error dispersion
#> 1 0.25 1 0.4285714 NA
#>
#> [[2]]$svm.list$`9`
#> gamma cost error dispersion
#> 1 1 32 0.5 NA
#>
#> [[2]]$svm.list$`10`
#> gamma cost error dispersion
#> 1 0.0078125 16 0.45 NA
#>
#>
#> [[2]]$error
#> [1] 0.4283333
#>
#>
#> [[3]]
#> [[3]]$svm.list
#> [[3]]$svm.list$`1`
#> gamma cost error dispersion
#> 1 0.125 32 0.2857143 NA
#>
#> [[3]]$svm.list$`2`
#> gamma cost error dispersion
#> 1 0.0625 16 0.5238095 NA
#>
#> [[3]]$svm.list$`3`
#> gamma cost error dispersion
#> 1 0.25 0.25 0.3333333 NA
#>
#> [[3]]$svm.list$`4`
#> gamma cost error dispersion
#> 1 0.5 1 0.2857143 NA
#>
#> [[3]]$svm.list$`5`
#> gamma cost error dispersion
#> 1 0.0625 16 0.4761905 NA
#>
#> [[3]]$svm.list$`6`
#> gamma cost error dispersion
#> 1 0.015625 4 0.4285714 NA
#>
#> [[3]]$svm.list$`7`
#> gamma cost error dispersion
#> 1 0.25 16 0.4285714 NA
#>
#> [[3]]$svm.list$`8`
#> gamma cost error dispersion
#> 1 0.5 32 0.3809524 NA
#>
#> [[3]]$svm.list$`9`
#> gamma cost error dispersion
#> 1 0.0625 32 0.3 NA
#>
#> [[3]]$svm.list$`10`
#> gamma cost error dispersion
#> 1 0.25 16 0.5 NA
#>
#>
#> [[3]]$error
#> [1] 0.3942857
#>
#>
#> [[4]]
#> [[4]]$svm.list
#> [[4]]$svm.list$`1`
#> gamma cost error dispersion
#> 1 0.125 2 0.3809524 NA
#>
#> [[4]]$svm.list$`2`
#> gamma cost error dispersion
#> 1 0.125 32 0.4761905 NA
#>
#> [[4]]$svm.list$`3`
#> gamma cost error dispersion
#> 1 0.015625 64 0.4285714 NA
#>
#> [[4]]$svm.list$`4`
#> gamma cost error dispersion
#> 1 0.125 16 0.2857143 NA
#>
#> [[4]]$svm.list$`5`
#> gamma cost error dispersion
#> 1 0.03125 8 0.3333333 NA
#>
#> [[4]]$svm.list$`6`
#> gamma cost error dispersion
#> 1 0.0625 1 0.3809524 NA
#>
#> [[4]]$svm.list$`7`
#> gamma cost error dispersion
#> 1 0.03125 16 0.4285714 NA
#>
#> [[4]]$svm.list$`8`
#> gamma cost error dispersion
#> 1 0.125 8 0.3809524 NA
#>
#> [[4]]$svm.list$`9`
#> gamma cost error dispersion
#> 1 0.125 2 0.3 NA
#>
#> [[4]]$svm.list$`10`
#> gamma cost error dispersion
#> 1 0.125 32 0.4 NA
#>
#>
#> [[4]]$error
#> [1] 0.3795238
#>
#>
#> [[5]]
#> [[5]]$svm.list
#> [[5]]$svm.list$`1`
#> gamma cost error dispersion
#> 1 0.0625 16 0.2380952 NA
#>
#> [[5]]$svm.list$`2`
#> gamma cost error dispersion
#> 1 0.0078125 16 0.4285714 NA
#>
#> [[5]]$svm.list$`3`
#> gamma cost error dispersion
#> 1 0.0625 8 0.4761905 NA
#>
#> [[5]]$svm.list$`4`
#> gamma cost error dispersion
#> 1 0.03125 4 0.2380952 NA
#>
#> [[5]]$svm.list$`5`
#> gamma cost error dispersion
#> 1 0.0625 1 0.2857143 NA
#>
#> [[5]]$svm.list$`6`
#> gamma cost error dispersion
#> 1 0.25 0.5 0.4285714 NA
#>
#> [[5]]$svm.list$`7`
#> gamma cost error dispersion
#> 1 0.03125 1 0.3333333 NA
#>
#> [[5]]$svm.list$`8`
#> gamma cost error dispersion
#> 1 0.00390625 32 0.4285714 NA
#>
#> [[5]]$svm.list$`9`
#> gamma cost error dispersion
#> 1 0.0078125 64 0.35 NA
#>
#> [[5]]$svm.list$`10`
#> gamma cost error dispersion
#> 1 0.25 2 0.55 NA
#>
#>
#> [[5]]$error
#> [1] 0.3757143
Each featsweep
list element corresponds to using that many of the top features (i.e. featsweep[1]
is using only the top feature, featsweep[2]
is using the top 2 features, etc.). Within each, svm.list
contains the generalization error estimates for each of the 10 folds in the external 10-fold CV. These accuracies are averaged as error
.
To show these results visually, we can plot the average generalization error vs. the number of top features used as input.
For reference, it is useful to show the chance error rate. Typically, this is equal to the “no information” rate we would get if we simply always picked the class label with the greater prevalence in the data set.
no.info <- min(prop.table(table(input[,1])))
errors <- sapply(featsweep, function(x) ifelse(is.null(x), NA, x$error))
PlotErrors(errors, no.info=no.info)
As you can probably see, the main limitation in doing this type of exploration is processing time. For example, in this demonstration, and just considering the number of times we have to fit an SVM, we have:
Feature ranking
10 external folds x 546 features x 10 msvmRFE folds = 54600 linear SVMs
Generalization error estimation
10 external folds x 546 features x 169 hyperparamter combos for exhaustive search x 10 folds each = 9227400 RBF kernel SVMs
We have already shortened this some by 1) eliminating more than one feature at a time in the feature ranking step, and 2) only estimating generalization accuracies across a sweep of the top features.
Other ideas to shorten processing time:
This code is already set up to use lapply
calls for these 2 mains task, so fortunately, they can be relatively easily parallelized using a variety of R packages that include parallel versions of lapply
. Examples include: