Marmanis H., Babenko D. — Algorithms of the intelligent web |
Clustering by data type 129—130
Clustering distribution, optimality 315
Clustering in high dimensions 157
Clustering, agglomerative hierarchical algorithms 129
Clustering, algorithm 302
Clustering, arbitrary objects 128
Clustering, array sorting 125
Clustering, average distance 137
Clustering, average-link algorithm 137
Clustering, BIRCH algorithm 131
Clustering, book example 122
Clustering, categorical data 130
Clustering, categorization 128
Clustering, centroid 142
Clustering, computational complexity 157
Clustering, conceptual modeling 129
Clustering, constrained algorithms 130
Clustering, curse of dimensionality 159
Clustering, data normalization 127
Clustering, data squashing 158
Clustering, DBSCAN 151
Clustering, dendrograms 132
Clustering, density-based algorithms 151
Clustering, divisive hierarchical algorithms 129
Clustering, epsilon neighborhood 154
Clustering, Euclidian distance 127
Clustering, fine tuning 316
Clustering, goodness measure 150
Clustering, hierarchical 316
Clustering, hierarchical algorithms 129
Clustering, high dimensionality 158
Clustering, human expert 127
Clustering, iterative optimization 129
Clustering, k-means algorithm 129 142
Clustering, lack of normalization 134
Clustering, large databases 131
Clustering, link-based algorithms 134
Clustering, many dimensions 128
Clustering, mean value 142
Clustering, MST 139
Clustering, news articles 129
Clustering, objective 150
Clustering, overview 128
Clustering, partitional algorithms 129
Clustering, performance, characteristics 157
Clustering, point density 151
Clustering, proximity threshold 135
Clustering, R-trees 158
Clustering, ROCK 147
Clustering, single-link algorithm 135
Clustering, singletons 140
Clustering, Sourceforge-like case study 123
Clustering, spectral methods 130
Clustering, SQL limitations 125
Clustering, SQLEM algorithm 125
Clustering, threshold parameter 129
Clustering, very large datasets 157
Clustering, visual identification 124
Clustering, VLDB 131
Clustering, wavelet methods 130
Clustering, with SQL 124
Cochran’s Q test 233 250 255—256
Codehaus XFire see “Apache CXF”
Collaboration, as opposed to intelligence 4
Collaborative filtering see “CF”
Collaborative platforms 5
Collective intelligence 4
Collective knowledge, capture 4
Combination of classifiers 232
Combination of classifiers, computational robustness 233
Combination of classifiers, representational advantage 233
Combination of classifiers, risk reduction 233
Combining classifiers, bagging 234
Combining classifiers, boosting 234
Comparator 126
Complexity, multiclass clasifkation 224
Computational cluster 63
Computational complexity 157
Computational cost 316
Computational linguistics 327
Computational nodes 63
Concept 175 182
ConceptMajorityVoter 264
conceptPriors 50
conceptPriors, map 183
conditional probabilities 50 183
Conditional probabilities, user clicks 48
Confidence interval 221
conflict resolution 196—198
Confusion matrix 220 243 259 274
Constrained clustering algorithms 130
Content 13
Content aggregation, digg.com 99
Content aggregator 6 9
Content annotation 9
Content cleansing 283
Content field 27 31
Content impurities 283
Content reconciliation 7
Content similarity, case study 93
Content similarity, normalization 103
Content similarity, text analysis sensitivity 96
Content-based, accumulation and analysis 80
Content-based, recommendation 70
Corcho, Oscar 165
Correlation, complete negative 111
Correlation, complete positive 111
Cosine similarity 95 324
CosineDistance 152
CosineSimilarity 149 187
CosineSimilarityMeasure 95
Cost, function 223 230
Cost, matrix 230
craigslist 13
Crawler 13
Crawler, collecting data 22
Crawler, custom 281
Crawler, fetched documents 24
Crawler, known URLs 24
Crawler, page links 24
Crawler, processed documents 24
Crawling 23 30 281—282
Crawling, Apache Tika 321
Crawling, custom web crawler 320
Crawling, depth of 13
Crawling, Heritrix 321
Crawling, Nutch 321
Crawling, retrieved content structure 282
Crawling, stages of 320
CrawlResultsNewsDataset 284
createClusters 300—301
createClustersWithinTopics 300 305
Credibility of classification 219
Credit card activity 236
Credit risk 233
Credit score 236
Credit worthiness, attributes 235
Credit worthiness, case study 233
Credit worthiness, overview 234
CreditErrorEstimator 244 266 274
Criminal record 236
Cross product calculation 111
Cross-referencing 304
Curse of dimensionality 159 166
CustomAnalyzer 95
Cutting, Doug 22
Dag 172 202
Damping factor 36
| DangerousUserType 239
Dangling node 62
Dangling node, heuristic 67
Data diversity 265
Data incongruent 17
Data missing values 17
Data noisy 259
Data normalization 17 156 204
Data normalization, PearsonCorrelation 113
Data preprocessing 204
Data reliability 17
Data renormalization 115
Data representation, inaccuracies 17
Data size issues 18
Data squashing 158
Data understanding 207
Data understanding, importance 279
Data variability 17
DataGenerator 240
Datapoint 146 154
Dataset, dimensionality 156
DataSetAdapter 311
DBSCAN, algorithm 162
DBSCAN, border point 154
DBSCAN, core point 154
DBSCAN, directly density reachable 154
DBSCAN, eps variable 154
DBSCAN, ink drops analogy 151
DBSCAN, minPoints variable 154
DBSCANAlgorithm 152—154
Decision tree 170 245 258 273
Decision tree classification, instability 247
Decision tree classification, interpretation 247
Decision tree, accuracy 246
Decision tree, algorithms 171
Decision tree, classifier 234 266—267
decisionTree, printTree 247
Declarative programming 188
Default analyzer 31
Degree of belief 81
Degree of credibility 223
Degree of freedom 251 255—256
DELPHI 310—311
Delphi, Dataset interface 81
Delphi, inner workings 86
Delphi, recommend 87
Delphi, recommendation engine 80—81
Delphi, similarity between users 82
DelphiUC 103
DelphiUR 103
Dendrogram 132
Dendrogram, data structure 132—133
Dendrogram, initialized 138
Dendrogram, two linked hash maps 132
Dendrogram, visual representation of 132
Density-based, algorithms 151
Density-based, spatial clustering of applications with noise see “DBSCAN”
dez 165
Dhillon, Inderjit S. 145
Diagnosis of diseases 166
Diagnosis of injuries 166
Diff2PropTest 253
Difference of proportions test 233 250 253
Digg stories, blood donors 146
Digg stories, CSV file 146
Digg stories, Facebook 146
Digg, API 99 146
Digg, RESTful services 14
DiggCategory 100
DiggDelphi, findSimilarUsers 103
DiggDelphi, getTopNFriends 103
DiggDelphi, inner workings 102
DiggDelphi, recommend 103—104
Dimensionality, curse of 157
Directed acyclic graphs see “DAG”
Directed graph 34
Discourse 288 328
Distance 154
Distance, city block 324
Distance, Euclidean 324
Distance, L2 324
Distance, properties 73
Distance, symmetry 74
Distance, taxi cab 324
Distance, triangle inequality 74
Distributed computing, fallacies 17—19
Distribution of clusters 305
Divisive hierarchical algorithms 129
Docid field 27
DocRank 55—56 280 286
DocRank, inner workings 57
DocRank, matrix builder 57
DocRank, relational tables 61
DocRank, values reused 61
Doctype field 27
Document collection, business news 23
Document collection, Lance Armstrong 23
Document collection, U.S. politics 23
Document collection, world news 23
Document, distance 92
Document, heuristic importance 59
Document, terms 286 288
Domain of discourse 5
Dot (inner) product 96
Drools 165 170 189 193
Drools attribute, no-loop 197
Drools attribute, ruleflow-group 197
Drools attribute, salience 197
Drools, ReteOO 190
DTCreditClassifier 246 259 266
Dunham, Margaret 158
Eclipse 189
Eisner, Jason 161
Elements of intelligence, synergy 100
EM algorithm, E-step 161
EM algorithm, M-step 161
Email categorization 174 178 187
Email classification, blacklists 175
Email classification, header tests 175
Email classification, idiosyncracies 175
Email classification, real-time blackhole lists 175
Email classification, whitelists 175
Email concept, NOT SPAM 178
Email concept, SPAM 178
Email content, congressional elections 175
Email content, global warming 175
Email content, Lance Armstrong 175
Email content, marathon 175
Email content, newspaper advertisement 175
Email content, Nicaragua elections 175
Email content, NVidia stock 175
Email content, Ortega 175
Email content, spam 175
Email content, U.S. politics 175
Email content, world news 175
Email messages, sorting 174
EmailClassifier 175—176 178 184
EmailData 176
EmailDataset 176
EmailDataset, getTrainingSet 178
EmailDataset, setBinary (false) 187
Emaillnstance 178
EmailRuleClassifier 192
Embedding intelligence 11
Engage 9
Ensembles of classifiers 263
Ensembles, accuracy 260
Epictetus 232
Epsilon neighborhood 154
