-
- Weitere Informationen zu diesem Buch:
Inhaltsverzeichnis | Index | Probekapitel | Kolophon | Rezensionen |
Beispiele |
- Weitere Informationen zu diesem Buch:
Building Smart Web 2.0 Applications
First Edition September 2007
ISBN 978-0-596-52932-1
Weitere Informationen zu diesem Buch
Inhaltsverzeichnis |
Index |
Probekapitel |
Kolophon |
Rezensionen |
Beispiele |
Index
[ A ], [ B ], [ C ], [ D ], [ E ], [ F ], [ G ], [ H ], [ I ], [ J ], [ K ], [ L ], [ M ], [ N ], [ O ], [ P ], [ Q ], [ R ], [ S ], [ T ], [ U ], [ V ], [ W ], [ X ], [ Y ], [ Z ],
A[ Top ]
advancedclassify.py
dotproduct function, 203
dpclassify function, 205
getlocation function, 207, 208
getoffset function, 213
lineartrain function, 202
loadnumerical function, 209
matchcount function, 206
matchrow class
loadmatch function, 198
milesdistance function, 207, 208
nonlinearclassify function, 213
rbf function, 213
scaledata function, 210
scaleinput function, 210
yesno function, 206
agesonly.csv file, 198
Akismet, xvii, 138
akismettest.py, 138
algorithms, 4
CART (see CART)
collaborative filtering, 8
feature-extraction, 228
genetic (see genetic algorithms)
hierarchical clustering, 35
Item-based Collaborative Filtering Recommendation Algorithms, 27
mass-and-spring, 111
matrix math, 237
other uses for learning, 5
PageRank (see PageRank algorithm)
stemming, 61
summary, 277-306
Bayesian classifier, 277-281
Amazon, 5, 53
recommendation engines, 7
annealing
defined, 95
simulated, 95-96
articlewords dictionary, 231
artificial intelligence (AI), 3
artificial neural network (see neural network, artificial)
Atom feeds
counting words in, 31-33
parsing, 309
Audioscrobbler, 28
B[ Top ]
backpropagation, 80-82, 287
Bayes' Theorem, 125
Bayesian classification, 231
Bayesian classifier, 140, 277-281
classifying, 279
combinations of features, 280
naïve, 279
strengths and weaknesses, 280
support-vector machines (SVMs), 225
training, 278
Beautiful Soup, 45, 310
crawler, 57
installation, 311
usage example, 311
bell curve, 174
best-fit line, 12
biotechnology, 5
black box method, 288
blogs
clustering based on word frequencies, 30
feeds
counting words, 31-33
filtering, 134-136
(see also Atom feeds; RSS feeds)
Boolean operations, 84
breeding, 97, 251, 263
C[ Top ]
CART (Classification and Regression Trees), 145-146
categorical features
determining distances using Yahoo! Maps, 207
lists of interests, 206
yes/no questions, 206
centroids, 298
chi-squared distribution, 130
classifiers
basic linear, 202-205
Bayesian (see Bayesian classifier)
decision tree, 199-201
decision tree (see decision tree classifier)
naïve Bayesian (see naïve Bayesian classifier)
neural network, 141
persisting trained, 132-133
SQLite, 132-133
supervised, 226
training, 119-121
classifying
Bayesian classifier, 279
documents, 118-119
training classifiers, 119-121
click-training network, 74
closing price, 243
clustering, 29, 226, 232
column, 40-42
common uses, 29
hierarchical (see hierarchical clustering)
K-means, 248
K-means clustering (see K-means clustering)
word vectors (see word vectors)
clusters of preferences, 44-47
Beautiful Soup, 45
clustering results, 47
defining distance metric, 47
getting and preparing data, 45
scraping Zebo results, 45
Zebo, 44
clusters.py, 38
bicluster class, 35
draw2d function, 51
drawdendrogram function, 39
drawnode function, 39
getheight function, 38
hcluster function, 36
printclust function, 37
readfile function, 34
rotatematrix function, 40
scaledown function, 50
cocktail party problem, 226
collaborative filtering, 7
algorithm, 8
term first used, 8
collective intelligence
defined, 2
introduction, 1-6
column clustering, 40-42
conditional probability, 122, 319
Bayes' Theorem, 125
content-based ranking, 64-69
document location, 65
normalization, 66
word distance, 65, 68
word frequency, 64, 66
converting longitudes and latitudes of two points into distance in miles, 208
cost function, 89-91, 109, 304
global minimum, 305
local minima, 305
crawler, 56-58
Beautiful Soup API, 57
code, 57-58
urllib2, 56
crawling, 54
crossover, 97, 251, 263
cross-validation, 176-178, 294
leave-one-out, 196
squaring numbers, 177
test sets, 176
training sets, 176
cross-validation function, 219
cumulative probability, 185
D[ Top ]
data clustering (see clustering)
data matrix, 238
data, viewing in two dimensions, 49-52
dating sites, 5
decision boundary, 201
decision tree classifier, 199, 281-284
interactions of variables, and, 284
strengths and weaknesses, 284
training, 281
decision tree modeling, 321
decision trees, 142-166
best split, 147-148
CART algorithm, 145-146
classifying new observations, 153-154
disadvantages of, 165
displaying, 151-153
graphical, 152-153
early stopping, 165
entropy, 148
exercises, 165
Gini impurity, 147
introducing, 144-145
missing data, 156-158, 166
missing data ranges, 165
modeling home prices, 158-161
Zillow API, 159-161
modeling hotness, 161-164
multiway splits, 166
numerical outcomes, 158
predicting signups, 142-144
pruning, 154-156
real world, 155
recursive tree binding, 149-151
result probabilities, 165
training, 145-146
when to use, 164-165
del.icio.us, xvii, 314
building link recommender, 19-22
building dataset, 20
del.icio.us API, 20
recommending neighbors and links, 22
deliciousrec.py
fillItems function, 21
initializeUserDict function, 20
dendrogram, 34
drawing, 38-40
drawnode function, 39
determining distances using Yahoo! Maps, 207
distance metric
defining, 47
distance metrics, 29
distributions, uneven, 183-188
diversity, 268
docclass.py
classifer class
catcount method, 133
categories method, 133
fcount method, 132
incc method, 133
incf method, 132
setdb method, 132
totalcount method, 133
classifier class, 119, 136
classify method, 127
fisherclassifier method, 128
fprob method, 121
train method, 121
weightedprob method, 123
fisherclassifier class
classify method, 131
fisherprob method, 129
setminimum method, 131
getwords function, 118
naivebayes class, 124
prob method, 125
sampletrain function, 121
document filtering, 117-141
Akismet, 138
arbitrary phrase length, 140
blog feeds, 134-136
calculating probabilities, 121-123
assumed probability, 122
conditional probability, 122
classifying documents, 118-119
training classifiers, 119-121
exercises, 140
Fisher method, 127-131
classifying items, 130
combining probabilities, 129
versus naïve Bayesian filter, 127
improving feature detection, 136-138
naïve Bayesian classifier, 123-127
choosing category, 126
naïve Bayesian filter
versus Fisher method, 127
neural network classifier, 141
persisting trained classifiers, 132-133
SQLite, 132-133
Pr(Document), 140
spam, 117
document filtering (continued)
varying assumed probabilities, 140
virtual features, 141
document location, 65
content-based ranking
document location, 67
dorm.py, 106
dormcost function, 109
printsolution function, 108
dot-product, 322
code, 322
dot-products, 203, 290
downloadzebodata.py, 45, 46
E[ Top ]
eBay, xvii
eBay API, 189-195, 196
developer key, 189
getting details for item, 193
performing search, 191
price predictor, building, 194
Quick Start Guide, 189
setting up connection, 190
ebaypredict.py
doSearch function, 191
getCategory function, 192
getHeaders function, 190
getItem function, 193
getSingleValue function, 190
makeLaptopDataset function, 194
sendrequest function, 190, 191
elitism, 266
entropy, 148, 320
code, 320
Euclidean distance, 203, 316
code, 316
k-nearest neighbors (kNN), 293
score, 10-11
exact matches, 84
F[ Top ]
Facebook, 110
building match dataset, 223
creating session, 220
developer key, 219
downloading friend data, 222
matching on, 219-224
other Facebook predictions, 225
facebook.py
arefriends function, 223
createtoken function, 221
fbsession class, 220
getfriends function, 222
getinfo method, 222
getlogin function, 221
getsession function, 221
makedataset function, 223
makehash function, 221
sendrequest method, 220
factorize function, 238
feature extraction, 226-248
news, 227-230
feature-extraction algorithm, 228
features, 277
features matrix, 234
feedfilter.py, 134
entryfeatures method, 137
feedforward algorithm, 78-80
feedparser, 229
filtering
documents (see document filtering)
rule-based, 118
spam
threshold, 126
tips, 126
financial fraud detection, 6
financial markets, 2
Fisher method, 127-131
classifying items, 130
combining probabilities, 129
versus naïve Bayesian filter, 127
fitness function, 251
flight data, 116
flight searches, 101-106
full-text search engines (see search engines)
futures markets, 2
G[ Top ]
Gaussian function, 174, 321
code, 321
Gaussian-weighted sum, 188
generatefeedvector.py, 31, 32
getwords function, 31
generation, 97
genetic algorithms, 97-100, 306
crossover or breeding, 97
generation, 97
mutation, 97
population, 97
versus genetic programming, 251
genetic optimization stopping criteria, 116
genetic programming, 99, 250-276
breeding, 251
building environment, 265-268
creating initial population, 257
crossover, 251
data types, 274
dictionaries, 274
lists, 274
objects, 274
strings, 274
diversity, 268
elitism, 266
exercises, 276
fitness function, 251
function types, 276
further possibilities, 273-275
hidden functions, 276
measuring success, 260
memory, 274
mutating programs, 260-263
mutation, 251
nodes with datatypes, 276
numerical functions, 273
overview, 250
parse tree, 253
playing against real people, 272
programs as trees, 253-257
Python and, 253-257
random crossover, 276
replacement mutation, 276
RoboCup, 252
round-robin tournament, 270
simple games, 268-273
Grid War, 268
playing against real people, 272
round-robin tournament, 270
stopping evolution, 276
successes, 252
testing solution, 259
tic-tac-toe simulator, 276
versus genetic algorithms, 251
Geocoding, 207
API, 207
Gini impurity, 147, 319
code, 320
global minimum, 94, 305
Goldberg, David, 8
Google, 1, 3, 5
PageRank algorithm (see PageRank algorithm)
Google Blog Search, 134
gp.py, 254-258
buildhiddenset function, 259
constnode class, 254, 255
crossover function, 263
evolve function, 265, 268
fwrapper class, 254, 255
getrankfunction function, 267
gridgame function, 269
hiddenfunction function, 259
humanplayer function, 272
mutate function, 261
node class, 254, 255
display method, 256
exampletree function, 255
makerandomtree function, 257
paramnode class, 254, 255
rankfunction function
breedingrate, 266
mutationrate, 266
popsize, 266
probexp, 266
probnew, 266
scorefunction function, 260
tournament function, 271
grade inflation, 12
Grid War, 268
player, 276
group travel cost function, 116
group travel planning, 87-88
car rental period, 89
cost function (see cost function)
departure time, 89
price, 89
time, 89
waiting time, 89
GroupLens, 25
web site, 27
groups, discovering, 29-53
blog clustering, 53
clusters of preferences (see clusters of preferences)
column clustering (see column clustering)
data clustering (see data clustering)
exercises, 53
hierarchical clustering (see hierarchical clustering)
groups, discovering (continued)
K-means clustering (see K-means clustering)
Manhattan distance, 53
multidimensional scaling (see multidimensional scaling)
supervised versus unsupervised learning, 30
H[ Top ]
heterogeneous variables, 178-181
scaling dimensions, 180
hierarchical clustering, 33-38, 297
algorithm for, 35
closeness, 35
dendrogram, 34
individual clusters, 35
output listing, 37
Pearson correlation, 35
running, 37
hill climbing, 92-94
random-restart, 94
Holland, John, 100
Hollywood Stock Exchange, 5
home prices, modeling, 158-161
Zillow API, 159-161
Hot or Not, xvii, 161-164
hotornot.py
getpeopledata function, 162
getrandomratings function, 162
HTML documents, parser, 310
hyperbolic tangent (tanh) function, 78
I[ Top ]
inbound link searching, 85
inbound links, 69-73
PageRank algorithm, 70-73
simple count, 69
using link text, 73
independent component analysis, 6
independent features, 226-249
alternative display methods, 249
exercises, 248
K-means clustering, 248
news sources, 248
optimizing for factorization, 249
stopping criteria, 249
indexing, 54
adding to index, 61
building index, 58-62
finding words on page, 60
setting up schema, 59
tables, 59
intelligence, evolving, 250-276
inverse chi-square function, 130
inverse function, 172
IP addresses, 141
item-based bookmark filtering, 28
Item-based Collaborative Filtering Recommendation Algorithms, 27
item-based filtering, 22-25
getting recommendations, 24-25
item comparison dataset, 23-24
versus user-based filtering, 27
J[ Top ]
Jaccard coefficient, 14
K[ Top ]
Kayak, xvii, 116
API, 101, 106
data, 102
firstChild, 102
getElementsByTagName, 102
kayak.py, 102
createschedule function, 105
flightsearch function, 103
flightsearchresults function, 104
getkayaksession( ) function, 103
kernel
best kernel parameters, 225
kernel methods, 197-225
understanding, 211
kernel trick, 212-214, 290
radial-basis function, 213
kernels
other LIBSVM, 225
K-means clustering, 42-44, 248, 297-300
function for doing, 42
k-nearest neighbors (kNN), 169-172, 293-296
cross-validating, 294
defining similarity, 171
Euclidean distance, 293
number of neighbors, 169
scaling and superfluous variables, 294
strengths and weaknesses, 296
weighted average, 293
when to use, 195
L[ Top ]
Last.fm, 5
learning from clicks (see neural network, artificial)
LIBSVM
applications, 216
matchmaker dataset and, 218
other LIBSVM kernels, 225
sample session, 217
LIBSVM library, 291
line angle penalization, 116
linear classification, 202-205
dot-products, 203
vectors, 203
LinkedIn, 110
lists of interests, 206
local minima, 94, 305
longitudes and latitudes of two points into distance in miles, converting, 208
M[ Top ]
machine learning, 3
limits, 4
machine vision, 6
machine-learning algorithms (see algorithms)
Manhattan distance, 14, 53
marketing, 6
mass-and-spring algorithm, 111
matchmaker dataset, 197-219
categorical features, 205-209
creating new, 209
decision tree algorithm, 199-201
difficulties with data, 199
LIBSVM, applying to, 218
scaling data, 209-210
matchmaker.csv file, 198
mathematical formulas, 316-322
conditional probability, 319
dot-product, 322
entropy, 320
Euclidean distance, 316
Gaussian function, 321
Gini impurity, 319
Pearson correlation coefficient, 317
Tanimoto coefficient, 318
variance, 321
weighted mean, 318
matplotlib, 185, 313
installation, 313
usage example, 314
matrix math, 232-243
algorithm, 237
data matrix, 238
displaying results, 240, 246
factorize function, 238
factorizing, 234
multiplication, 232
multiplicative update rules, 238
NumPy, 236
preparing matrix, 245
transposing, 234
matrix, converting to, 230
maximum-margin hyperplane, 215
message boards, 117
minidom, 102
minidom API, 159
models, 3
MovieLens, using dataset, 25-27
multidimensional scaling, 49-52, 53, 300-302
code, 301
function, 50
Pearson correlation, 49
multilayer perceptron (MLP) network, 74, 285
multiplicative update rules, 238
mutation, 97, 251, 260-263
N[ Top ]
naïve Bayesian classifier, 123-127, 279
choosing category, 126
strengths and weaknesses, 280
versus Fisher method, 127
national security, 6
nested dictionary, 8
Netflix, 1, 5
network visualization
counting crossed lines, 112
drawing networks, 113
layout problem, 110-112
network vizualization, 110-115
neural network, 55
artificial, 74-84
backpropagation, 80-82
connecting to search engine, 83
designing click-training network, 74
feeding forward, 78-80
setting up database, 75-77
training test, 83
neural network classifier, 141
neural networks, 285-288
backpropagation, and, 287
black box method, 288
combinations of words, and, 285
multilayer perceptron network, 285
strengths and weaknesses, 288
synapses, and, 285
training, 287
using code, 287
news sources, 227-230
newsfeatures.py, 227
getarticlewords function, 229
makematrix function, 230
separatewords function, 229
shape function, 237
showarticles function, 241, 242
showfeatures function, 240, 242
stripHTML function, 228
transpose function, 236
nn.py
searchnet class, 76
generatehiddennode function, 77
getstrength method, 76
setstrength method, 76
nnmf.py
difcost function, 237
non-negative matrix factorization (NMF), 232-239, 302-304
factorization, 30
goal of, 303
update rules, 303
using code, 304
normalization, 66
numerical predictions, 167
numpredict.py
createcostfunction function, 182
createhiddendataset function, 183
crossvalidate function, 177, 182
cumulativegraph function, 185
distance function, 171
dividedata function, 176
euclidian function, 171
gaussian function, 175
getdistances function, 171
inverseweight function, 173
knnestimate function, 171
probabilitygraph function, 187
probguess function, 184, 185
rescale function, 180
subtractweight function, 173
testalgorithm function, 177
weightedknn function, 175
wineprice function, 168
wineset1 function, 168
wineset2 function, 178
NumPy, 236, 312
installation on other platforms, 313
installation on Windows, 312
usage example, 313
using, 236
O[ Top ]
online technique, 296
Open Web APIs, xvi
optimization, 86-116, 181, 196, 304-306
annealing starting points, 116
cost function, 89-91, 304
exercises, 116
flight searches (see flight searches)
genetic algorithms, 97-100
crossover or breeding, 97
generation, 97
mutation, 97
population, 97
genetic optimization stopping criteria, 116
group travel cost function, 116
group travel planning, 87-88
car rental period, 89
cost function (see cost function)
departure time, 89
price, 89
time, 89
waiting time, 89
hill climbing, 92-94
line angle penalization, 116
network visualization
counting crossed lines, 112
drawing networks, 113
layout problem, 110-112
network vizualization, 110-115
pairing students, 116
preferences, 106-110
cost function, 109
running, 109
student dorm, 106-108
random searching, 91-92
representing solutions, 88-89
round-trip pricing, 116
simulated annealing, 95-96
where it may not work, 100
optimization.py, 87, 182
annealingoptimize function, 95
geneticoptimize function, 98
elite, 99
maxiter, 99
mutprob, 99
popsize, 99
getminutes function, 88
hillclimb function, 93
printschedule function, 88
randomoptimize function, 91
schedulecost function, 90
P[ Top ]
PageRank algorithm, 5, 70-73
pairing students, 116
Pandora, 5
parse tree, 253
Pearson correlation
hierarchical clustering, 35
multidimensional scaling, 49
Pearson correlation coefficient, 11-14, 317
code, 317
Pilgrim, Mark, 309
polynomial transformation, 290
poplib, 140
population, 97, 250, 306
diversity and, 257
Porter Stemmer, 61
Pr(Document), 140
prediction markets, 5
price models, 167-196
building sample dataset, 167-169
eliminating variables, 196
exercises, 196
item types, 196
k-nearest neighbors (kNN), 169
laptop dataset, 196
leave-one-out cross-validation, 196
optimizing number of neighbors, 196
search attributes, 196
varying ss for graphing probability, 196
probabilities, 319
assumed probability, 122
Bayes' Theorem, 125
combining, 129
conditional probability, 122
graphing, 186
naïve Bayesian classifier (see naïve Bayesian classifier)
of entire document given classification, 124
product marketing, 6
public message boards, 117
pydelicious, 314
installation, 314
usage example, 314
pysqlite, 58, 311
importing, 132
installation on other platforms, 311
installation on Windows, 311
usage example, 312
Python
advantages of, xiv
tips, xv
Python Imaging Library (PIL), 38, 309
installation on other platforms, 310
usage example, 310
Windows installation, 310
Python, genetic programming and, 253-257
building and evaluating trees, 255-256
displaying program, 256
representing trees, 254-255
traversing complete tree, 253
Q[ Top ]
query layer, 74
querying, 63-64
query function, 63
R[ Top ]
radial-basis function, 212
random searching, 91-92
random-restart hill climbing, 94
ranking
content-based (see content-based ranking)
queries, 55
recommendation engines, 7-28
building del.icio.us link recommender, 19-22
building dataset, 20
del.icio.us API, 20
recommending neighbors and links, 22
collaborative filtering, 7
collecting preferences, 8-9
nested dictionary, 8
recommendation engines (continued)
exercises, 28
finding similar users, 9-15
Euclidean distance score, 10-11
Pearson correlation coefficient, 11-14
ranking critics, 14
which metric to use, 14
item-based filtering, 22-25
getting recommendations, 24-25
item comparison dataset, 23-24
item-based filtering versus user-based filtering, 27
matching products, 17-18
recommending items, 15-17
weighted scores, 15
using MovieLens dataset, 25-27
recommendations based on purchase history, 5
recommendations.py, 8
calculateSimilarItems function, 23
getRecommendations function, 16
getRecommendedItems function, 25
loadMovieLens function, 26
sim_distance function, 11
sim_pearson function, 13
topMatches function, 14
transformPrefs function, 18
recursive tree binding, 149-151
returning ranked list of documents from query, 55
RoboCup, 252
round-robin tournament, 270
round-trip pricing, 116
RSS feeds
counting words in, 31-33
filtering, 134-136
parsing, 309
rule-based filters, 118
S[ Top ]
scaling and superfluous variables, 294
scaling data, 209-210
scaling dimensions, 180
scaling, optimizing, 181-182
scoring metrics, 69-73
PageRank algorithm, 70-73
simple count, 69
using link text, 73
search engines
Boolean operations, 84
content-based ranking (see content-based ranking)
crawler (see crawler)
document search, long/short, 84
exact matches, 84
exercises, 84
inbound link searching, 85
indexing (see indexing)
overview, 54
querying (see querying)
scoring metrics (see scoring metrics)
vertical, 101
word frequency
bias, 84
word separation, 84
searchengine.py
addtoindex function, 61
crawler class, 55, 57, 59
createindextables function, 59
distancescore function, 68
frequencyscore function, 66
getentryid function, 61
getmatchrows function, 63
gettextonly function, 60
import statements, 57
importing neural network, 83
inboundlinkscore function, 69
isindexed function, 58, 62
linktextscore function, 73
normalization function, 66
searcher class, 65
nnscore function, 84
query method, 83
searchnet class
backPropagate function, 81
trainquery method, 82
updatedatabase method, 82
separatewords function, 60
searchindex.db, 60, 62
searching, random, 91-92
self-organizing maps, 30
sigmoid function, 78
signups, predicting, 142-144
simulated annealing, 95-96, 305
socialnetwork.py, 111
crosscount function, 112
drawnetwork function, 113
spam filtering, 117
method, 4
threshold, 126
tips, 126
SpamBayes plug-in, 127
spidering, 56 (see crawler)
SQLite, 58
embedded database interface, 311
persisting trained classifiers, 132-133
tables, 59
squaring numbers, 177
stemming algorithm, 61
stochastic optimization, 86
stock market analysis, 6
stock market data, 243-248
closing price, 243
displaying results, 246
Google's trading volume, 248
preparing matrix, 245
running NMF, 246
trading volume, 243
Yahoo! Finance, 244
stockfeatures.txt file, 247
stockvolume.py, 245, 246
factorize function, 246
student dorm preference, 106-108
subtraction function, 173
supervised classifiers, 226
supervised learning methods, 29, 277-296
supply chain optimization, 6
support vectors, 216
support-vector machines (SVMs), 197-225, 289-292
Bayesian classifier, 225
building model, 224
dot-products, 290
exercises, 225
hierarchy of interests, 225
kernel trick, 290
LIBSVM, 291
optimizing dividing line, 225
other LIBSVM kernels, 225
polynomial transformation, 290
strengths and weaknesses, 292
synapses, 285
T[ Top ]
tagging similarity, 28
Tanimoto coefficient, 47, 318
code, 319
Tanimoto similarity score, 28
temperature, 306
test sets, 176
third-party libraries, 309-315
Beautiful Soup, 310
matplotlib, 313
installation, 313
usage example, 314
NumPy, 312
installation on other platforms, 313
installation on Windows, 312
usage example, 313
pydelicious, 314
installation, 314
usage example, 314
pysqlite, 311
installation on other platforms, 311
installation on Windows, 311
usage example, 312
Python Imaging Library (PIL), 309
installation on other platforms, 310
usage example, 310
Windows installation, 310
Universal Feed Parser, 309
trading behavior, 5
trading volume, 243
training
Bayesian classifier, 278
decision tree classifier, 281
neural networks, 287
sets, 176
transposing, 234
tree binding, recursive, 149-151
treepredict.py, 144
buildtree function, 149
classify function, 153
decisionnode class, 144
divideset function, 145
drawnode function, 153
drawtree function, 152
entropy function, 148
mdclassify function, 157
printtree function, 151
prune function, 155
split_function, 146
uniquecounts function, 147
variance function, 158
trees (see decision trees)
U[ Top ]
uneven distributions, 183-188
graphing probabilities, 185
probability density, estimating, 184
Universal Feed Parser, 31, 134, 309
unsupervised learning, 30
unsupervised learning techniques, 296-302
unsupervised techniques, 226
update rules, 303
urllib2, 56, 102
Usenet, 117
user-based collaborative filtering, 23
user-based efficiency, 28
user-based filtering
versus item-based filtering, 27
V[ Top ]
variance, 321
code, 321
varying assumed probabilities, 140
vector angles, calculating, 322
vectors, 203
vertical search engine, 101
virtual features, 141
W[ Top ]
weighted average, 175, 293
weighted mean, 318
code, 318
weighted neighbors, 172-176
bell curve, 174
Gaussian function, 174
inverse function, 172
subtraction function, 173
weighted kNN, 175
weighted scores, 15
weights matrix, 235
Wikipedia, 2, 56
word distance, 65, 68
word frequency, 64, 66
bias, 84
word separation, 84
word usage patterns, 226
word vectors, 30-33
clustering blogs based on word frequencies, 30
counting words in feed, 31-33
wordlocation table, 63, 64
words commonly used together, 40
X[ Top ]
XML documents, parser, 310
xml.dom, 102
Y[ Top ]
Yahoo! application key, 207
Yahoo! Finance, 53, 244
Yahoo! Groups, 117
Yahoo! Maps, 207
yes/no questions, 206
Z[ Top ]
Zebo, 44
scraping results, 45
web site, 45
Zillow API, 159-161
zillow.py
getaddressdata function, 159
getpricelist function, 160
Zurück zu Programming Collective Intelligence
