Questa è una vecchia versione del documento!

WarLogs Dataset

The dataset contains a subset of reports concerning the Iraq war, from 2004 to 2009, published by WikiLeak on October 22, 2010.

The dataset, already cleaned and preprocessed, is made of a relational table with the following attributes:

report_key | text: ID of report
to_timestamp | timestamp: date of release of report (up to the minute)
Type | text: Macro-classification of events in each report
category | text: Specific classification of each report
region | text: Class of location of the event
attack_on | text: target of event/attack of the report
coalition_forces_wounded | integer: n. coalition force units wounded in the event/attack
coalition_forces_killed | integer: n. coalition force units killed in the event/attack
iraq_forces_wounded | integer: n. Iraqi force units wounded in the event/attack
iraq_forces_killed | integer: n. Iraqi force units killed in the event/attack
civilian_wia | integer: n. civilians wounded in the event/attack
civilian_kia | integer: n. vicilians killed in the event/attack
enemy_wia | integer: n. enemy units wounded in the event/attack
enemy_kia | integer: n. enemy units killed in the event/attack
enemy_detained | integer: n. enemy units captured in the event/attack
total_deaths | integer: total number of deaths in the event/attack
st_x | numeric: longitude of event/attack location
st_y | numeric: latitude of event/attack location

The dataset is in CVS format: warlogs.csv.zip
Here is also a small sample of data (2000 reports): warlogs2000.csv.zip

Problem

The exercise requires to perform two clusterings on the dataset:

group events based on the impact on the population and on the forces involved (casualties, captured or wounded units, etc.)
group events based on location, in order to discover geografical areas where events are more dense. Optionally, the temporal dimension can be involved in the process (e.g. to split the dataset or as additional attribute in the clustering)

The content of each cluster. Si richiede una analisi dei dati forniti utilizzando i metodi di clustering forniti da Weka mettendo in relazione i diversi tipi (attributo “Type”) di attacco e le distribution di morti, feriti e nemici catturati.
Per ogni clustering effettuato è necesario fornire una motivatione/spiegazione del risultato in base al valore degli altri attributi (es. correlazione tra diversi attributi e classe di clustering).

Suggerimenti:

I termini di alcuni attributi possono contenere alcuni errori/ripetizioni (al momento non viene rivelato quali sono questi attributi). Verificare se questi errori emergono all'interno del clustering
Alcuni attributi hanno un ruolo di classe (ad esempio, l'attributo “region”). Verificare che la correlazione con gli attributi “st_x” e “st_y” sia valida.
Provare a selezionare sottoinsiemi di attributi per eseguire clustering separati
Una volta determinato un clustering del dataset, è possibile selezionare uno dei cluster come un dataset separato per eseguire ulteriori analisi solo sul gruppo scelto? (è necessario l'uso di filtri esterni a Weka)