Note: The intended audience of these posts is 100% myself, or other technical people who are looking for snippets of stored information/code for a specific task.
Tue, 05 Jun 2018 06:09:37 +0000
Socat can be used to intercept view and then forward arbitrary network traffic.
Run this in one terminal:
$ socat -v TCP-LISTEN:3307 TCP:localhost:3306
This will cause socat to listen to port 3307 and forward traffic to localhost port 3306, where you presumably have a MySQL running.
Now you can connect to MySQL via.
$ mysql -u root -h 127.0.0.1 -P 3307 -p
This will cause the MySQL client to connect to the socat listener, which in turn will forward traffic to your MySQL instance and print the data, in this case some of the data are unprintable binary bytes so they are represented as periods but the parts of the MySQL protocol which are text can be gleaned.
... \f....show tables< 2018/06/04 23:02:24.011748 length=117 from=463 to=579 .....W....def.information_schema\vTABLE_NAMES\vTABLE_NAMES.Tables_in_database TABLE_NAME\f!................users\a......"...> 2018/06/04 23:02:24.012169 length=11 from=291 to=301 \a....users.< 2018/06/04 23:02:24.061067 length=61 from=580 to=640 .....def.database.users.users.name.name\f!............\a..........> 2018/06/04 23:02:31.717434 length=24 from=302 to=325 .....select * from users< 2018/06/04 23:02:31.767557 length=102 from=641 to=742 .....-....def.database.users.users.name.name\f!................Tom.....Jeff.....Scott\a..\a..."...
Sat, 28 Apr 2018 04:57:15 +0000
CHANGE MASTER TO MASTER_HOST = 'mydb.dsn.com', MASTER_PORT = 3306, MASTER_USER = 'root', MASTER_PASSWORD = 'password', MASTER_LOG_FILE = 'mysql-bin-changelog.00000x', MASTER_LOG_POS = 12345; START SLAVE;
MariaDB [production]> show slave status\G; *************************** 1. row *************************** Slave_IO_State: Queueing master event to the relay log Master_Host: production-db Master_User: root Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin-changelog.000021 Read_Master_Log_Pos: 106457060 Relay_Log_File: mysqld-relay-bin.000031 Relay_Log_Pos: 64715845 Relay_Master_Log_File: mysql-bin-changelog.000014 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 64715536 Relay_Log_Space: 1045994443 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 114732 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 65224191 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: No Gtid_IO_Pos: Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: conservative SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Write_rows_log_event::write_row(-1) 1 row in set (0.00 sec) ERROR: No query specified
mysql> show master status; +----------------------------+-----------+--------------+------------------+-------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +----------------------------+-----------+--------------+------------------+-------------------+ | mysql-bin-changelog.000083 | 118809022 | | | | +----------------------------+-----------+--------------+------------------+-------------------+
Sat, 28 Apr 2018 04:33:34 +0000
#!/bin/bash for snapshot in `zfs list -H -t snapshot | cut -f 1` do zfs destroy $snapshot done
Mon, 23 Apr 2018 02:58:12 +0000
sudo /bin/dd if=/dev/zero of=/var/swap.1 bs=1M count=1024 sudo /sbin/mkswap /var/swap.1 sudo chmod 600 /var/swap.1 sudo /sbin/swapon /var/swap.1
Sun, 22 Apr 2018 06:27:33 +0000
As part of a project to explore migrating an AWS’s Aurora MySQL database to an EC2 instance I performed several tests on EC2 SSD/EBS disk IOPS and bandwidth which I’ll share below.
Why migrate from Aurora to something else? Aurora is great but AWS charges $0.20 per million IO requests and for some intensive workloads the IO operations required to serve queries over a month really add up; in my experience it’s not uncommon for heavily loaded database instances to run into the tens of billions of IO requests per month. 10 billion IO requests costs $2,000 and that’s not including the cost of the instance itself. Of course this all depends on your applications pattern and frequency of data access but a few billion IO operations really isn’t that much IO, especially if one is running any sort of periodic ETL jobs against the database or one of it’s replicas.
The tests below compare several EBS volumes against a local instance SSD on an i3.large instance.
|Disk Type||Disk Size||Read IOPS||Write IOPS||Read Throughput MB/s||Write Throughput MB/s||Cost $GB/month|
|EBS – Cold HDD (sc1)||500GB||31||10||45||44||0.025|
|EBS – Throughput HDD (st1)||500GB||101||33||53||53||0.045|
|EBS – Standard SSD (gp2)||500GB||2074 (bursting)||692 (bursting)||22-53
|NVMe Instance Store SS3 (i3.large)||442GB||43281||14443||238||200||0.2477|
During the duration of the GP2 SSD state the disk would’ve been in burst mode which means it theoretically had access to a 3,000 IOPS limit, up from it’s base 1,500 IOPS limit. I was also very surprised by the lower read throughput of the GP2 SSD. The volume would consistently deliver 22MB/s for a period and then sometime later it would shoot to 45MB/s. I can’t really explain why the GP2 volume has such a variable performance.
The i3.large has a stated limit of 53.1MB/s of EBS bandwidth which is represented in several of the tests. When testing smaller reads and writes the effective EBS bandwidth may appear to be greater than 53.1MB/s which I assume is due to some caching mechanisms.
For testing methods see Test Disk IOPS and Read/Write Bandwidth
Sun, 22 Apr 2018 05:31:58 +0000
Command to test disk IOPS
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64 fio-2.2.10 Starting 1 process test: Laying out IO file(s) (1 file(s) / 4096MB) bs: 1 (f=1): [m(1)] [78.1% done] [8707KB/2753KB/0KB /s] [2176/688/0 iops] [eta 01m:23s] fio: terminating on signal 2 test: (groupid=0, jobs=1): err= 0: pid=2864: Sun Apr 22 05:38:14 2018 read : io=2398.4MB, bw=8297.9KB/s, iops=2074, runt=295968msec write: io=819616KB, bw=2769.3KB/s, iops=692, runt=295968msec cpu : usr=0.49%, sys=1.70%, ctx=200905, majf=0, minf=9 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued : total=r=613972/w=204904/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: io=2398.4MB, aggrb=8297KB/s, minb=8297KB/s, maxb=8297KB/s, mint=295968msec, maxt=295968msec WRITE: io=819616KB, aggrb=2769KB/s, minb=2769KB/s, maxb=2769KB/s, mint=295968msec, maxt=295968msec Disk stats (read/write): xvde: ios=613645/214996, merge=0/57, ticks=4398044/4579196, in_queue=8977428, util=100.00%
Test Disk Read Bandwidth
hdparm -t /dev/xvdc
$ hdparm -t /dev/xvdc /dev/xvdc: Timing buffered disk reads: 168 MB in 3.02 seconds = 55.60 MB/sec
Test Disk Write Bandwidth
Note: You should probably run this several times or increase the count until the result stabilizes, this is to eliminate the impact of any disk caches.
dd if=/dev/zero of=/mnt/throughtput_hdd/output bs=8k count=100k
$ dd if=/dev/zero of=/mnt/throughtput_hdd/output bs=8k count=100k 102400+0 records in 102400+0 records out 838860800 bytes (839 MB, 800 MiB) copied, 16.476 s, 50.9 MB/s
Mon, 29 Feb 2016 19:45:47 +0000
I recently had the need to map top level domains and languages to country codes, and since I couldn’t find a simple python package to do this I built this wrapper for the geonames country data . I got the idea from the django-countries-plus python package.
Sure I could have probably done what I needed, cobbling together a few other packages, but this seemed cleaner.
Mon, 29 Feb 2016 03:59:01 +0000
Motivation: A client wanted to find a large set of phone numbers from the public web; all types of numbers, not just US phone numbers.
The common crawl provided a convenient to way to efficiently process a large portion of the web and extra phone numbers. A big thanks to Ben C. at yelp for writing this article Analyzing the Web For the Price of a Sandwich, it provided a great starting place. While Ben’s goal was to search for only US phone numbers, I needed to extract phone numbers from all countries and so I opted to use the phonenumbers python library, despite it being “slower”.
I found that using the common crawl’s WET files was sufficient to extract phone numbers (text content only, as apposed to full HTML). Processing the WET files meant the total dataset was roughly 8TB, down from the 151TB of the full crawl, despite this it still took ~10 hours to fully process; the phonenumbers library is indeed quite a bit slower/more CPU intensive than a simple regex, but in this case I think it was definitely worth it, because it provided a clean, tested method to extract all forms phone numbers from text.
Initially I thought that I’d have to try and detect the pages’s associated country, in order to extract phone numbers, because if a number isn’t fully qualified eg. ‘555-845-3567’, then it’s interpretation depends on the country it’s being called from, but this turned out to not be necessary, as the client’s requirement was to extract at minimum 1,000 for numbers for every country/zone, and the common crawl contains more than 1,000 fully qualified numbers for every relevant zone.
As I was using spot instances, in order to avoid losing work, if a job failed, and to avoid reducing over the entirety of the common crawl, I split the crawl into a handful of map reduce jobs.
Initially I tried a batch size of 900 WET files, which took in 20 minutes, not too bad, but I also noticed that the cluster often sat idle, due to a combination of a few straggling map operations and the overhead of starting a job with mrjob.
Upping the batch size to 3648 improved things a bit, it only took 60 minutes to complete that job, 25% faster; mostly due to eliminating the overhead associated with scheduling jobs.
I had tried to optimize the mappers associated with the job to fit onto the available cores in stages (19 machines * 32 cores per machine * 6 =3648), but unfortunately, a few of the jobs failed and were rerun, which added an extra final “stage” where only a few mappers we running at a time. Since each mapper processes an entire WET file this final stage added quite a bit of processing time to the total job run time.
I think the moral of the story here is that it’s not easy to exactly tune a map reduce job to fit onto a cluster, there will be intermittent failures, and that’s just something you have to deal with, but larger batch sizes and smaller individual map operations will reduce the overhead associated with each failure.
Actually if mrjob allowed scheduling multiple jobs onto the same EMR cluster, then It wouldn’t have mattered as much, if one of the jobs was stuck waiting on a few stragglers to complete, the rest of the cluster would be busy processing the next job.