dm:start:clustering
Differenze
Queste sono le differenze tra la revisione selezionata e la versione attuale della pagina.
Prossima revisione | Revisione precedente | ||
dm:start:clustering [18/12/2012 alle 14:09 (12 anni fa)] – creata Fosca Giannotti | dm:start:clustering [18/12/2012 alle 14:20 (12 anni fa)] (versione attuale) – [Guidelines for the homework on clustering] Fosca Giannotti | ||
---|---|---|---|
Linea 5: | Linea 5: | ||
* **Data Understanding: | * **Data Understanding: | ||
- | * Distribution | + | * Distribution analysis and suitable transformation of variables |
* Elimination of redundant variables by correlation analysis | * Elimination of redundant variables by correlation analysis | ||
* **Clustering Analysis by K-means: (15 points)** | * **Clustering Analysis by K-means: (15 points)** | ||
* Identification of the best value of k | * Identification of the best value of k | ||
- | * Characterization of the obtained clusters by using both analysis of the k centroids and comparison of the distribution of variables within the clusters and in the whole dataset | + | * Characterization of the obtained clusters by using both analysis of the k centroids and comparison of the distribution of variables within the clusters and that in the whole dataset |
* **Analysis by density-based clustering (7 points)** | * **Analysis by density-based clustering (7 points)** | ||
Linea 18: | Linea 18: | ||
* **Analysis by hierarchical clustering (Optional - 3 points)** | * **Analysis by hierarchical clustering (Optional - 3 points)** | ||
* Analysis to be performed on a sampling of the data for scalability reasons | * Analysis to be performed on a sampling of the data for scalability reasons | ||
+ | |||
+ | |||
+ | ====== Description of the variables ====== | ||
+ | |||
+ | For each car driver we observe the following quantities, measured over a certain time window of mobile activity: | ||
+ | |||
+ | Length = total traveled distance (m.) | ||
+ | Duration = total time spent driving (sec.) | ||
+ | Count = number of different trips | ||
+ | Phighway = distance traveled on highways (m.) | ||
+ | Pcity = distance traveled inside cities (m.) | ||
+ | Length_arc_crowded = distance traveled on 20% most crowded roads (m.) | ||
+ | Pnight = distance traveled at night time (m.) | ||
+ | Pover = distance traveled over speed limit (m.) | ||
+ | Profile = number of systematic trips, e.g., work-home | ||
+ | Radius_g = radius of gyration: sparsity of location from the center of mass of the driver (mean position) | ||
+ | Radius_g_L1 = radius of gyration w.r.t. L1: sparsity of location from the driver' | ||
+ | Avg_Dist_L1 = average distance from L1: average distance from the driver' | ||
+ | TimeL1L2 = % time spent at locations L1 and L2 (most and second most preferred locations) | ||
+ | EntropyArc = entropy on road segment frequencies, | ||
+ | EntropyLocation = entropy on location frequencies, | ||
+ | EntropyTime = entropy on hours of the day, measures the diversity of daily patterns | ||
+ | |||
+ | Notice that there are no missing values in the dataset, hence " | ||
+ | |||
dm/start/clustering.1355839756.txt.gz · Ultima modifica: 18/12/2012 alle 14:09 (12 anni fa) (modifica esterna)