The tethering_detection from jqk6

/***************
 * Tethering detection
***************/


/***************
 * Features
***************/
1. IP layer
   a) TTL
   b) ID field
   c) Inter-arrival time

2. TCP layer
   a) # of connections (# of <src IP, src port, dst IP, dst port>)
   b) round-trip time
   c) TCP WIN

3. UDP layer
   a) # of connections

4. Application layer
   a) User-Agent


/***************
 * Directories
***************/
1. dataset_narus, dataset_sprint, dataset_uscc are the links to the dataset

2. 172.31.7.162:/data/ychen
    stores the copy of the dataset on local disk.

3. plot_summary_ttl
    plot the results generated by tool1.pl
    /export/home/ychen/sprint/output/detect.TTL.summary

4. tethered_clients
    IP of tethered clients detected by each heuristics

5. output_boot_time_freq
    used to store the frequency calculated by "detect_tethering_boot_time.pl"


/***************
 * Sub-Tasks
***************/
1. subtask_initial_ttl
    Some time ago, the initial TTL of Windows is 64.
    Just a quick check if there is any client whose initial TTL is 64.

    - input: ../output/file.$file_id.ttl.txt

    - output:
        ./output
        <# client> <# normal client w/ TTL_X> <...> <# tethered client w/ TTL_X> <...>

2. subtask_ttl_distribution
    a) Check if the TTLs behind a mobile station only differs by 1.
    b) Check the distribution of TTL values.

    some useful info:
    - http://forums.macrumors.com/showthread.php?t=1140306
    - http://www.map.meteoswiss.ch/map-doc/ftp-probleme.htm

3. subtask_nontether_id
    Generate IP's ID field timeseries of non-tethered users. 
    We want to see if there is difference between ID from tehtered users and non-tethered users.

4. subtask_tcp_seq_ack
    Group TCP packets by <src IP, src port, dst IP, dst port> and manually check the sequence/ACK numbers.

5. subtask_http_agents
    List all the User-Agent from http header of each user

6. subtask_clock_skew
    Calculate the clock skew using TCP Timestamp.

    a) calculate_clock_skew.pl
       I followed the method in the paper "Remote physical device fingerprinting" page 5:

        t_i: the receiving time of the packet i
        T_i: the TCP Timestamp of the packet i
        Hz: the clock frequency (so T_i / Hz ~ the tx time)

        Let x_i = t_i - t_1, which is the interval between the pkt i and the pkt 1 observed by the receiver.
        Let w_i = (T_i - T_1) / Hz, which is the interval between the pkt i and the pkt 1 from the sender.

        And let y_i = w_i - x_i, which is the observed offset of the packet i. 
        So, the slope of the offset-set (x_i, y_i) is the skew of the clock.


    b) calculate_clock_skew_remove_rtt.pl
       Based on the above codes, I try to remove one way delay estimated by RTT from TCP Timestamp -- but doesn't have any improvement.

    c) calculate_clock_skew_sue.pl 
       Based on "calculate_clock_skew.pl", use Moon Sue's alg to calculate clock skew.

       ClockSkewMoon.pm
       My implementation of Moon Sue's clock skew estimation algorithm proposed in "Estimation and Removal of Clock Skew from Network Delay Measureme

    d) calculate_clock_skew_remove_delay_intermediate_sue.pl
       Based on "calculate_clock_skew_sue.pl" and take two trace files as input, one from destination node and one from an intermediate node, and try to remove one way dalay by combining two traces.

    e) calculate_clock_skew_sue_segment.pl
       modified from calculate_clock_skew_sue.pl. Take its output and calculate the clock skew of smaller segments.

       - batch run:
         i) batch_40_machines_avg_std_skew.sh
         ii) batch_40_machines_seg_size.sh


7. subtask_boot_time
    Calculate the boot time using TCP Timestamp option.

    a) group_by_tcp_timestamp.pl
       Read the processed pcap file (which includes TCP Timestamp) and calculate the Timestamp of some time t.
       It also output the calculated Timestamp of each IP.

    b) group_by_tcp_timestamp.m
       Read the output of "group_by_tcp_timestamp.pl" and use K-Means to classify calculated Timestamps.
       Then it uses "partition index" to determine which number of partition yields best clustering.

    c) group_by_tcp_timestamp.java
       Read the output of "group_by_tcp_timestamp.pl" and use "X-Means" and "DBScan" to see how many clusters are there.
       The clustering method is implemented by weka.

    d) group_by_tcp_timestamp_elki.java
       Read the output of "group_by_tcp_timestamp.pl" and use "DBScan" to see how many clusters are there.
       The clustering method is implemented by ELKI.

    e) group_by_tcp_timestamp_elki.sh
       Read the output of "group_by_tcp_timestamp.pl" and use "DBScan" to see how many clusters are there.
       This script uses the command mode (GUI mode) of ELKI.

    f) group_by_tcp_timestamp_boottime.pl
       Based on "group_by_tcp_timestamp.pl", but instead of calculate the initial TCP Timestamp, this one calculate the boot time.

    g) - MyBootTime.pm
       - estimate_boot_time.pl
       - estimate_freqs.pl
       The code tries to implement recusive functions to better estimate the boot time and frequency -- but need some modification to make them work.
       The basic idea is to iterate all possible frequencies and apply them to calculate the boot time from each packet. If these packets are from the same device, the difference between calculated boot time should be small.

       Is called by "estimate_boot_time.pl" and "estimate_freqs.pl"

    h) - analyze_freq.pl
        Analyze the frequence change over packets.
        a) latest frequency
        b) avg frequency
        c) avg frequency in a time window
        d) EWMA
       - analyze_freq_per_flow.pl
        Same as above but just do that per flow.

       batch run: analyze_freq.sh


8. subtask_boot_time.stable
    time_to_stable_boottime_per_flow.pl
        Analyze the following things:
        a) How long does the frequence per flow take to become stable.
        b) How long are the flows (in seconds)
        c) How many packets per flow
        d) How much traffic per flow
        e) How many flows per IP


9. subtask_boot_time.deep_with_user_agent
    - get_freq_of_ip_labeled_by_user_agent.pl
        Combine TTL, boot time, and clock frequency heuristics to detect the tethering usage. The result is compared with that of User Agent heuristics.

    - calculate_flow_statistics.pl
        Analyze the following things:
        a) the ratio of flows/IPs with User Agent
        b) the ratio of flows/IPs with recognizable User Agent (i.e. User-Agent with keywords like "windows", "android", "mac os", ...)
        c) the ratio of flows/IPs with Timestamp
        d) the ratio of flows/IPs with Timestamp and is long enough to estimate accurate boot time

    - calculate_boot_time_intervals.pl
        Get the boot time intervals from flows with different User Agents of an IP.

    - sanity_check_unstable_freq.sh
        Use tshark to output RTT, TCP Timestamp, and packet receiving time to manually calculate the frequency.
        This is used to check the frequency is indeed unstable and check if it's related to RTT.


10. subtask_compare
    This task is to compare the result of any heuristic with a more reliable one (e.g. User Agent) and evaluate the parameters/corretness of the heuristic.


11. subtask_tcp_flavor
    This task is to identify the TCP flavor by congestion window size

12. subtask_tcp_timestamp
    - timestamp_statistics.pl
        The code is used to get the following TCP Timestamp statistics
        a) # flows w/ and w/o TS
        b) # IPs w/ and w/o TS
        c) # flows w/ TS of various OS
        d) # IPs w/ TS of various OS
        e) # flows w/o TS of various OS
        f) # IPs w/o TS of various OS

13. subtask_iphone_freq
    The task is to figure why the clock frequency changes over time and, if possible, find the pattern.
    - analyze_freq_per_flow[| 2 | 3 | 4].pl
        Each code uses a different way to calculate the frequency of flows and evaluate them using the standard deviation of estimated boot time.


/***************
 * Sprint tools
***************/
1. 
    a) pcapParser.c
        only IP layer info
        - input: pcap_file
        - output: 
            print format
            <time> <time usec> <src ip> <dest ip> <proto> <ttl> <id> <length>

        - batch_pcapParser.sh
            - run pcapParser.c in batch
            - input files: /data/ychen/sprint/pcap
            - output files: /data/ychen/sprint/text

        note.
            ignore IP-ENCAP, IP fragmentation


    b) pcapParser2.c
        IP and TCP info
        - input: pcap_file
        - output: 
            print format
            <time> <time usec> <src ip> <dest ip> <proto> <ttl> <id> <length> <src port> <dst port> <seq> <ack seq> <flag fin> <flag syn> <flag rst> <flag push> <flag ack> <flag urg> <flag ece> <flag cwr> <win> <urp> <payload len>

        - batch_pcapParser2.sh
            - run pcapParser2.c in batch
            - input files: /data/ychen/sprint/pcap
            - output files: /data/ychen/sprint/text2

        note.
            ignore IP-ENCAP, IP fragmentation


    c) pcapParser3.c
        List all the User-Agent from http header of each user
        - input: pcap_file
        - output: 
            format
            <time> <time usec> <src ip> <dest ip> <proto> <ttl> <id> <length> <src port> <dst port> <seq> <ack seq> <flag fin> <flag syn> <flag rst> <flag push> <flag ack> <flag urg> <flag ece> <flag cwr> <win> <urp> <payload len>
            <http header>
            <new line>


        - batch_pcapParser3.sh
            - run pcapParser3.c in batch
            - input files: /data/ychen/sprint/pcap
            - output files: /data/ychen/sprint/text3


    d) pcapParser4.c
        reads in a pcap file and outputs "UDP" info
        - input: pcap_file
        - output: 
            format
            <time> <src ip> <dest ip> <proto> <ttl> <id> <length> <src port> <dst port> <length>


        - batch_pcapParser4.sh
            - run pcapParser4.c in batch
            - input files: /data/ychen/sprint/pcap
            - output files: /data/ychen/sprint/text4


    e) pcapParser5.c
        reads in a pcap file and find if there is timestamp option
        - input: pcap_file
        - output: 
            format
            <time> <src ip> <dest ip> <proto> <ttl> <id> <length> <src port> <dst port> <seq> <ack seq> <flag fin> <flag syn> <flag rst> <flag push> <flag ack> <flag urg> <flag ece> <flag cwr> <win> <urp> <payload len> <timestamp> <timestamp reply>


        - batch_pcapParser5.sh
            - run pcapParser5.c in batch
            - input files: /data/ychen/sprint/pcap
            - output files: /data/ychen/sprint/text5

    f) pcapParser6.c
        reads in a pcap file and find Window Scale option
        - input: pcap_file
        - output: 
            format
            <time> <time usec> <src ip> <dest ip> <proto> <ttl> <id> <length> <src port> <dst port> <seq> <ack seq> <flag fin> <flag syn> <flag rst> <flag push> <flag ack> <flag urg> <flag ece> <flag cwr> <win> <urp> <payload len> <window scale>


        - batch_pcapParser6.sh
            - run pcapParser6.c in batch
            - input files: /data/ychen/sprint/pcap
            - output files: /data/ychen/sprint/text6


2. 
    a) analyze_sprint_text.pl
        Group packets in flows, and analyze TTL, tput, pkt number, packet length entropy.

        - input: parsed_pcap_text
            format
            <time> <time usec> <src ip> <dest ip> <proto> <ttl> <id> <length>
        
        - output
            ./output/
            a) file.<id>.tput.ts.txt: 
                total throughput timeseries
            b) file.<id>.pkt.ts.txt
                total packet number timeseries
            c) file.<id>.ids.ts.txt
                IP ID of each packet of each flow
            d) file.<id>.ttl.txt
                TTL of each flow
            e) file.<id>.ttl.ts.txt
                timeseries of # of unique TTLs of each flow
            f) file.<id>.tput.ts.txt
                timeseries of tput of each flow
            g) file.<id>.pkt.ts.txt
                timeseries of # of packets of each flow
            i) file.$file_id.len_entropy.ts.txt
                timeseries of packet len entropy of each flow

        - batch_pcapParser.sh
            - run analyze_sprint_text.pl in batch
            - input files: /data/ychen/sprint/text
            - output files: 
                a) output of analyze_sprint_text.pl
                b) log file: /export/home/ychen/sprint/output/<input file>.log

        note. 
            a flow here means an unique <src IP, dst IP> pair
            XXX: should be changed to <src IP>??


    b) analyze_sprint_text_inter_arrival_time.pl
        Group packets in flows, and analyze inter-arrival time.

        - input: parsed_pcap_text
            format
            <time> <time usec> <src ip> <dest ip> <proto> <ttl> <id> <length>

        - output
            ./output/
            a) file.<id>.inter_arrival_time.ts.txt
            timeseries of inter-arrival time of each flow


    c) analyze_sprint_tcp_connections.pl
        Analyze # connections (i.e. <src ip> <src port> <dst ip> <dst port>) of different time bins

        - input: parsed_pcap_text
            format:
            <time> <time usec> <src ip> <dest ip> <proto> <ttl> <id> <length> <src port> <dst port> <seq> <ack seq> <flag fin> <flag syn> <flag rst> <flag push> <flag ack> <flag urg> <flag ece> <flag cwr> <win> <urp> <payload len>

        - output
            ./output/
            a) file.<id>.connections.bin<time bin size>.txt
                timeseries of # of connections


    d) analyze_sprint_tcp_rtt.pl
        Group packets in flows, and analyze TTL, tput, pkt number, packet length entropy.


    e) analyze_sprint_text_inter_arrival_time.pl
        Analyze # of TCP/UDP connections (i.e. <src ip> <src port> <dst ip> <dst port>) of different time bins
        @timebins = (1, 5, 10, 60, 600); ## the time bin size we want to analyze


    f) analyze_sprint_http_user_agents.pl
        Search HTTP User-Agent for the OS and device of the following keywords
            OS: Windows, Microsoft, Android, MAC, Ubuntu
            device: HTC, Samsung, LGE, NOKIA, Windows Phone, iPhone, iPad, MacBookAir 


3. 
    a) detect_tethering.pl

        Read in results from "analyze_sprint_text.pl" and "analyze_sprint_text_inter_arrival_time.pl" to detect tethering usage.
        a) The detection is based on number of different TTL per second. 
        b) After detecting tethered clients, calculate:
            i) how many tethered clients.
            ii) how much traffic are generated by tethered clients.
        c) Find the possible metrics as the detection confidence, and do the inter-flow/intra-flow analysis
            i) tput
            ii) # pkts
            iii) pkt length entropy

        - input: file_id
            The file ID of 3-hr Sprint Mobile Dataset.
            This program uses this ID to look up the output files from "analyze_sprint_text.pl", i.e.
            ./output/
            a) file.<id>.tput.ts.txt: 
                total throughput timeseries
            b) file.<id>.pkt.ts.txt
                total packet number timeseries
            c) file.<id>.ids.ts.txt
                IP ID of each packet of each flow
            d) file.<id>.ttl.txt
                TTL of each flow
            e) file.<id>.ttl.ts.txt
                timeseries of # of unique TTLs of each flow
            f) file.<id>.tput.ts.txt
                timeseries of tput of each flow
            g) file.<id>.pkt.ts.txt
                timeseries of # of packets of each flow
            i) file.$file_id.len_entropy.ts.txt
                timeseries of packet len entropy of each flow
            j) file.$file_id.inter_arrival_time.ts.txt
                timeseries of packet len entropy of each flow

        - output:
            a) Assume TTL heuristic is perfect:
                i) how many tethered clients.
                ii) how much traffic are generated by tethered clients.
            b) intra-flow analysis: the ratio of non-tethered traffic to tethered traffic 
                i) tput
                ii) # pkts
                iii) pkt length entropy
                iv) mean of inter-arrival time
                v) stdev of inter-arrival time
            c) inter-flow analysis: the ratio of non-tethered traffic to tethered traffic 
                i) tput
                ii) # pkts
                iii) pkt length entropy
                iv) mean of inter-arrival time
                v) stdev of inter-arrival time
            d) fig: generate #TTL/tput/#pkts/pkt_len_entropy timeseries of tethered clients detected by TTL
                ./figures_ttl/tehtered.<file_id>.<IP>.ts.txt.eps
            e) fig: generate IDs timeseries of tethered clients detected by TTL
                ./figures_ttl/tehtered.<file_id>.<IP>.ids.txt.eps

        - batch_detect_tethering.sh
            - run detect_tethering.pl in batch
            - input files: /export/home/ychen/sprint/output/
            - output files: /export/home/ychen/sprint/output/detect.TTL.<input file ID>.log


    b) detect_tethering_TTL.pl
        Read in results from "analyze_sprint_text.pl" and detect TTL tethering usage.
            a) > 1 TTL across the whole trace
            b) > 1 TTL at any second

        - input: file_id
            The file ID of 3-hr Sprint Mobile Dataset.
            This program uses this ID to look up the output files from "analyze_sprint_text.pl", i.e.
            ./output/
            a) file.<id>.ttl.txt
                TTLs of each flow
            b) file.<id>.ttl.ts.txt
                timeseries of # of unique TTLs of each flow

        - output:
            IP of tethered clients.
            a) ./tethered_clients/TTL_whole_trace.<file id>.txt
            b) ./tethered_clients/TTL_one_second.<file id>.txt

    c) detect_tethering_connections.pl
        Read in results from "analyze_sprint_tcp_connections.pl" and detect tethering using number of connections.
            e.g. > n connections at any time using time bin size b

        - input: file_id
            The file ID of 3-hr Sprint Mobile Dataset.
            This program uses this ID to look up the output files from "analyze_sprint_tcp_connections.pl", i.e.
            ./output/
            a) file.<id>.connections.bin<time bin size>.txt
                timeseries of # of connections

        - output:
            IP of tethered clients.
            ./tethered_clients/Connections_timebin<time bin size>.threshold<threshold>.<file id>.txt


    d) detect_tethering_rtt.pl
        Read in results from "analyze_sprint_tcp_rtt.pl" and detect tethering using the variance of RTT.
            e.g. when variance of RTT to the same destination is larger than some threshold

        - input: file_id
            The file ID of 3-hr Sprint Mobile Dataset.
            This program uses this ID to look up the output files from "analyze_sprint_tcp_rtt.pl", i.e.
            ./output/
            file.<id>.rtts.txt: 
            the RTT to different destinations
            format:
            <src ip>, <dst ip>, <RTTs>

        - output:
            IP of tethered clients.
            ./tethered_clients/RTT_variance.threshold<threshold>.<file id>.txt


    e) detect_tethering_inter_arrival_time.pl
        Read in results from "analyze_sprint_text_inter_arrival_time.pl" and detect tethering usage by inter-arrival time.
        a) The mean of inter-arrival time is smaller than some threshold
        b) The stdev of inter-arrival time is larger than some threshold


    f) detect_tethering_tput.pl
        Read in results from "analyze_sprint_text.pl" and detect tethering usage by throughput.
        e.g. The tput is larger than some threshold


    g) detect_tethering_pkt_len_entropy.pl
        Read in results from "analyze_sprint_text.pl" and detect tethering usage by entropy of pkt length.
        e.g. The entropy is larger than some threshold

    i) detect_tethering_TTL_default_value.pl
        Read in results from "analyze_sprint_text.pl" and detect TTL tethering usage.
        e.g. TTL != 63, 127, or 254

    j) detect_tethering_TTL_diff.pl
        Read in results from "analyze_sprint_text.pl" and detect TTL tethering usage.
        e.g. if there are multiple TTLs and their difference is 1 (or < some small number)

    k) detect_tethering_udp_connections.pl

    l) detect_tethering_tcp_udp_connections.pl

    m) - detect_tethering_user_agent.pl
       - detect_tethering_user_agent_no_win.pl
        Read in results from "analyze_sprint_http_user_agents.pl" and detect tethering using the number of OSs and devices. The only difference of "detect_tethering_user_agent_no_win.pl" is that it doesn't count "Windows" machine behind the mobile station. This is just used for boot_time based method becuase Windows disable TCP Timestamp option in default.


    n) detect_tethering_boot_time.pl
        Detect tethering by calculating the boot time.


4. cross_validate_detected_ip.pl
    Read IPs of tethered clients detected by different methods, and cross-validate the overlapping

    - input: 
        IP of tethered clients.
        Possible base:
        a) TTL (whole trace): ./tethered_clients/TTL_whole_trace.<file id>.txt
        b) TTL (one second) : ./tethered_clients/TTL_one_second.<file id>.txt
        c) TTL (default value) : TTL_default_value.<file id>.txt
        d) User Agent : User_agent.<file id>.txt
        e) TTL (diff) : TTL_diff.<file id>.txt

        Evaluation Methods:
        c) Connections : Connections_timebin<time bin size>.threshold<threshold>.<file id>.txt
            Time bins  = (1, 5, 10, 60, 600)
            Thresholds = (2 .. 30)
        d) RTT (variance) : RTT_variance.threshold<threshold>.<file id>.txt
            Thresholds = (0.05, 0.1, 0.15, 0.2, 0.25, 0.3, .. , 0.8)
        e) Inter-arrival time (mean : Inter_arrival_time_mean.threshold<threshold>.<file id>.txt
            Thresholds = (0.005, 0.01, 0.02, 0.03, 0.05, .. , 4)
        f) Inter-arrival time (stdev): Inter_arrival_time_stdev.threshold<threshold>.<file id>.txt
            Thresholds = (0.005, 0.01, 0.15, 0.2, 0.25, .. , 10)
        g) Throughput : Tput_whole_trace.threshold<threshold>.<file id>.txt
            Thresholds = (10, 15, 20, 25, 30, 40, 50, 60, .. , 10000)
        h) Pkt length Entropy : Pkt_len_entropy.timebin<time bin size>.threshold<threshold>.<file id>.txt
            Time bins  = (1, 600)
            Thresholds = (0.01, 0.015, 0.02, 0.025, 0.03, .. , 2)
        g) UDP Connections : UDP_Connections_timebin<time bin size>.threshold<threshold>.<file id>.txt
            Time bins  = (1, 5, 10, 60, 600)
            Thresholds = (2 .. 30)
        h) TCP/UDP Connections : TCP_UDP_Connections_timebin<time bin size>.threshold<threshold>.<file id>.txt
            Time bins  = (1, 5, 10, 60, 600)
            Thresholds = (2 .. 30)
        i) Boot Time : boot_time.method_<methods>.<parameters>.DIFF_<time diff>.NUM_<num pkt>.<file id>.txt
            Frequency estimation methods: (1, 2, 3)
                 1 = WINDOW based
                 2 = EWMA based
                 3 = last calculated freq
            Frequency estimation parameters: 
                 1: (10, 100)
                 2: (0.5, 0.9)
                 3: (1)
            THRESHOLD_EST_RX_DIFF = (1 5 30 120)
            OUT_RANGE_NUM = (1 5 10)

    - output:
        a) How many clients are detected by 1/2/3/4/5... methods
            ./tethered_clients/summary.<file id>.number_methods.txt
            format:
            <number of methods> <number of tethered clients>
        b) overlapping between methods
            ./tethered_clients/summary.<file id>.cross_validation.txt
            format:
            <method1> <method2> <overlap> <only by former> <only by latter> <# total detected clients> <overlap ratio> <only by former ratio> <only by latter ratio>

5. evaluate_methods_based_on_TTL.pl
    There are many methods to detect tehtering. This program use TTL heuristic as ground truth, and calculate precision and recall of other methods with various parameters (e.g. diff thresholds)

    - input: 
        IP of tethered clients.
        Possible base:
        a) TTL (whole trace): ./tethered_clients/TTL_whole_trace.<file id>.txt
        b) TTL (one second) : ./tethered_clients/TTL_one_second.<file id>.txt
        c) TTL (default value) : TTL_default_value.<file id>.txt
        d) User Agent : User_agent.<file id>.txt
        e) TTL (diff) : TTL_diff.<file id>.txt

        Evaluation Methods:
        c) Connections : Connections_timebin<time bin size>.threshold<threshold>.<file id>.txt
            Time bins  = (1, 5, 10, 60, 600)
            Thresholds = (2 .. 30)
        d) RTT (variance) : RTT_variance.threshold<threshold>.<file id>.txt
            Thresholds = (0.05, 0.1, 0.15, 0.2, 0.25, 0.3, .. , 0.8)
        e) Inter-arrival time (mean : Inter_arrival_time_mean.threshold<threshold>.<file id>.txt
            Thresholds = (0.005, 0.01, 0.02, 0.03, 0.05, .. , 4)
        f) Inter-arrival time (stdev): Inter_arrival_time_stdev.threshold<threshold>.<file id>.txt
            Thresholds = (0.005, 0.01, 0.15, 0.2, 0.25, .. , 10)
        g) Throughput : Tput_whole_trace.threshold<threshold>.<file id>.txt
            Thresholds = (10, 15, 20, 25, 30, 40, 50, 60, .. , 10000)
        h) Pkt length Entropy : Pkt_len_entropy.timebin<time bin size>.threshold<threshold>.<file id>.txt
            Time bins  = (1, 600)
            Thresholds = (0.01, 0.015, 0.02, 0.025, 0.03, .. , 2)
        g) UDP Connections : UDP_Connections_timebin<time bin size>.threshold<threshold>.<file id>.txt
            Time bins  = (1, 5, 10, 60, 600)
            Thresholds = (2 .. 30)
        h) TCP/UDP Connections : TCP_UDP_Connections_timebin<time bin size>.threshold<threshold>.<file id>.txt
            Time bins  = (1, 5, 10, 60, 600)
            Thresholds = (2 .. 30)
        i) Boot Time : boot_time.method_<methods>.<parameters>.DIFF_<time diff>.NUM_<num pkt>.<file id>.txt
            Frequency estimation methods: (1, 2, 3)
                 1 = WINDOW based
                 2 = EWMA based
                 3 = last calculated freq
            Frequency estimation parameters: 
                 1: (10, 100)
                 2: (0.5, 0.9)
                 3: (1)
            THRESHOLD_EST_RX_DIFF = (1 5 30 120)
            OUT_RANGE_NUM = (1 5 10)

    - output:
        a) ./tethered_clients_processed_data
        format:
        <threshold> <TP> <FN> <FP> <TN> <precision> <recall>

        b) figure: plot PR curve (Precision-Recall) using plot_pr.plot.mother
        ./tethered_clients_figures/



/***************
 * Helpers:
 *  some tools to help me quickly generate readable results
***************/

1. tool1.pl
    Read in the output of "detect_tethering.pl" and summarize results from all files into a single file.

    - input: /export/home/ychen/sprint/output/detect.TTL.<input file ID>.log

    - output: /export/home/ychen/sprint/output/detect.TTL.summary
        format:
        <1. # tethered clients>, <2. # clients>, 
        <3. # tethered pkts>, <4. # pkts>, 
        <5. tethered traffic (bytes)>, <6. traffic (bytes)>, 
        <7. intra-flow tput ratio>, <8. intra-flow #pkt ratio>, <9. intra-flow pkt len entropy ratio>, 
        <10. inter-flow tput ratio>, <11. inter-flow #pkt ratio>, <12. inter-flow pkt len entropy ratio>, 
        <13. inter-flow inter arrival time mean ratio>, <14. inter-flow inter arrival time stdev ratio>

2. tool2.pl
    Summarize how many different TTLs behind the same source IP.

    - input: ./output/file.<id>.ttl.txt

    - output: ./output/files.ttl.summary


3. tool3.pl
    Summarize the corss-validation results.
    
    - input: 
        a) ./tethered_clients/summary.$file_id.number_methods.txt
        b) ./tethered_clients/summary.$file_id.cross_validation.txt

    - output
        a) ./tethered_clients/summary.number_methods.txt
            format:
            <# tethered clients detected by 1 method>, <ratio>, 
            <# tethered clients detected by 2 methods>, <ratio>, 
            <# tethered clients detected by 3 methods>, <ratio>, 
            ...

        b) ./tethered_clients/summary.cross_validation.txt
            <method1>, <method2>, <# overlap>, <# former>, <# latter>, <# tethered clients>, <ratio overlap>, <ratio former>, <ratio latter>,
            <method1>, <method3>, <# overlap>, <# former>, <# latter>, <# tethered clients>, <ratio overlap>, <ratio former>, <ratio latter>,
            <method2>, <method3>, <# overlap>, <# former>, <# latter>, <# tethered clients>, <ratio overlap>, <ratio former>, <ratio latter>,
            ...



/***************
 * todo:
***************/
1. implement https://www.cs.columbia.edu/~smb/papers/fnat.pdf for ID heuristic
2. implement change point detection method
3. coefficient between confidence indicators
4. Detection codes
5. figure out the packets with abnormal TTL



/***************
 * jobs logs:
***************/
- pcapParser (fix a bug about fragments) ............................ done
    batch_analyze_sprint_text.sh  ................................... done
    batch_analyze_sprint_text_inter_arrival_time.sh ................. done
        batch_detect_tethering.sh  .................................. done
        batch_detect_tethering_TTL.sh ............................... done
        batch_detect_tethering_inter_arrival_time.sh ................ done
        batch_detect_tethering_pkt_len_entropy.sh  .................. done
        batch_detect_tethering_tput.sh  ............................. done
        batch_detect_tethering_TTL_default_value.sh  ................ done

- pcapParser2 (fix a bug about fragments) ........................... done
    batch_analyze_sprint_tcp_connections.sh  ........................ done
        batch_detect_tethering_connections.sh  ...................... done
    batch_analyze_sprint_tcp_rtt.sh  ................................ done
        batch_detect_tethering_rtt.sh  .............................. done
    subtask_tcp_seq_ack/batch_analyze_sprint_tcp_seq.sh ............. done

- batch_pcapParser3.sh  ............................................. done
    batch_analyze_sprint_http_user_agents.sh  ....................... done
        batch_detect_tethering_user_agent.sh  ....................... done

- batch_pcapParser4.sh  ............................................. done
    batch_analyze_sprint_udp_connections.sh  ........................ done
        batch_detect_tethering_udp_connections.sh  .................. done
    batch_analyze_sprint_tcp_udp_connections.sh  .................... done
        batch_detect_tethering_tcp_udp_connections.sh  .............. done

- batch_pcapParser5.sh
    batch_detect_tethering_boot_time.sh

- batch_pcapParser6.sh

-       batch_evaluate_methods_based_on_TTL.sh  ..................... done
        batch_cross_validation.sh  .................................. done



/***************
 * note:
***************/
1. The tethering detection methods are applied for all IPs in the trace, including clients and servers.
    It's meaningless to check the tethering of servers and the methods may not make sense (e.g. TTL is collected in client side and varies depending on the route.)
    So, during the evaluation, I didn't count the IPs which are not from cellular network (i.e. 28.XXX.XXX.XXX)
jqk6 / tethering_detection Goto Github PK

tethering_detection's Introduction

tethering_detection's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs