Strumenti Utente

Strumenti Sito


dm:start:clustering

Differenze

Queste sono le differenze tra la revisione selezionata e la versione attuale della pagina.

Link a questa pagina di confronto

Prossima revisione
Revisione precedente
dm:start:clustering [18/12/2012 alle 14:09 (12 anni fa)] – creata Fosca Giannottidm:start:clustering [18/12/2012 alle 14:20 (12 anni fa)] (versione attuale) – [Guidelines for the homework on clustering] Fosca Giannotti
Linea 5: Linea 5:
  
   * **Data Understanding: useful as a preliminary step to capture some data property that can help the clustering analysis (8 points)**   * **Data Understanding: useful as a preliminary step to capture some data property that can help the clustering analysis (8 points)**
-       * Distribution data analysis and suitable transformation of variables+       * Distribution analysis and suitable transformation of variables
        * Elimination of redundant variables by correlation analysis        * Elimination of redundant variables by correlation analysis
  
   * **Clustering Analysis by K-means: (15 points)**   * **Clustering Analysis by K-means: (15 points)**
        * Identification of the best value of k        * Identification of the best value of k
-       * Characterization of the obtained clusters by using both analysis of the k centroids and comparison of the distribution of variables within the clusters and in the whole dataset+       * Characterization of the obtained clusters by using both analysis of the k centroids and comparison of the distribution of variables within the clusters and that in the whole dataset
  
   * **Analysis by density-based clustering (7 points)**   * **Analysis by density-based clustering (7 points)**
Linea 18: Linea 18:
   * **Analysis by hierarchical clustering (Optional - 3 points)**   * **Analysis by hierarchical clustering (Optional - 3 points)**
     * Analysis to be performed on a sampling of the data for scalability reasons      * Analysis to be performed on a sampling of the data for scalability reasons 
 +
 +
 +====== Description of the variables ======
 +
 +For each car driver we observe the following quantities, measured over a certain time window of mobile activity:  
 +
 +  Length = total traveled distance (m.)
 +  Duration = total time spent driving (sec.)
 +  Count = number of different trips
 +  Phighway = distance traveled on highways (m.)
 +  Pcity = distance traveled inside cities (m.)
 +  Length_arc_crowded = distance traveled on 20% most crowded roads (m.)
 +  Pnight = distance traveled at night time (m.)
 +  Pover = distance traveled over speed limit (m.)
 +  Profile = number of systematic trips, e.g., work-home
 +  Radius_g = radius of gyration: sparsity of location from the center of mass of the driver (mean position)
 +  Radius_g_L1 = radius of gyration w.r.t. L1: sparsity of location from the driver's most frequent location (e.g., home)
 +  Avg_Dist_L1 = average distance from L1:  average distance from the driver's most frequent location (e.g., home)
 +  TimeL1L2 = % time spent at locations L1 and L2 (most and second most preferred locations)
 +  EntropyArc = entropy on road segment frequencies, measures the diversity of roads traveled
 +  EntropyLocation = entropy on location frequencies, measures the diversity of places visited
 +  EntropyTime = entropy on hours of the day, measures the diversity of daily patterns
 +
 +Notice that there are no missing values in the dataset, hence "0"s are actual "0"s, NOT missing values.
 +
        
dm/start/clustering.1355839756.txt.gz · Ultima modifica: 18/12/2012 alle 14:09 (12 anni fa) (modifica esterna)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki