In our team at work we have this concept which we call a “Bash off” the idea is simple. Take something that you could do very simply in bash in several steps with manual fudging in the middle to end up with a result. The sort of thing that may take you an hour to brute force your way through. Now in your programming language of choice, automate the entire process.
Sounds simple right ? I remember the first task we had which was to grab two unique columns of data and to stream them onto the screen, I think it was originally done with watch, tail and tee, took about 20 mins of playing with bash. As a team we chose our own languages and went for it, an hour later we all had a working prototype, another hour later we all had our programs really efficient.
Obviously these are really pointless, but they do have a couple of benefits that help a DevOps person stay good at what they do. It gets you to do more with your language of choice than you normally would and it also causes you to think about how you’re structuring code to make it more efficient, it also helps bond the team together with a bit of healthy competition.
To clarify just because it is called a bash off that dosen’t mean that a solution can’t be in bash. So if it was a very manual human process that you believe you can fully automate in bash, Go for it :)
This weeks challenge came from our friends in finance, they have a spreadsheet with some 14k rows in it, two of the columns have many duplicated fields in them and the have a one to many mapping between them. So the challenge is to get from 14k rows down to just the unique entries in each of these two columns and to then make sure the mapping in the spreadsheet can be looked up from a database (where the spreadsheet results must end up)
Due to some meetings I was late to the game on this task, pesky meetings; we had a bash driven prototype which we excluded because it required the excel spreadsheet to be turned into a csv and a python one that sort of got some fields from the spreadsheet and pushed them into the DB.
This is where I picked it up, sort of working; I’ve not done a lot of python so I forked the code and me and my boss went our separate ways to achieve the task, in the end it took us a couple of hours to get this working and for us to have the same results. All in all it we had achieved the processing of 14k rows of data and the manipulation into a db with the correct data, but was that enough? No We decided that it taking about 1 min was not good enough so we started focusing on making it better, I think my run time was 58 seconds and my bosses 40.
We had both chosen different ways of doing it, I had chosen to use the DB to ensure the fields were unique by checking if that field existed in the DB and if not to create it and returned the id, if it was in the DB it would return the id. My bosses approach was better, he created lists with the data in and then made the lists unique. I decided I had to get mine down to a similar speed so I started hacking it around; I decided I would store the unique entries in a list for each table and then before calling the method that puts the data in the database I would check if that unique value was already there. The first issue with this is I lost the ID number which I needed to add to the lookup table so I had to change the list to a dictionary.
I also found by moving the commit messages for the database out side of any of the methods and just dropping it at the end saved a few seconds, I was also able to remove a couple of additional DB queries all of this helped bring the time down. One of the best changes I did was on the lookup table when doing a query only pulling back one ID, rather than two; I didn’t even need it but I couldn’t see a way of querying sql alchemy with out having a field to bring back that would be quick.
As the night progressed we both made good progress, my Boss got his down to 14.1 seconds I managed to get mine down to 21 seconds so we had both made massive improvements in our codes efficiency. My boss was making use of gevent but when I tried this it slowed my program down so I left it out not understanding it anyway. I kept pluggin away and I made it down to 13.5 seconds.
I urge you to take an afternoon out in your team and to push your skills forward with a programming challenge and to see what happens, it will make you better, you will learn stuff, it is fun, it does bond the team and you will enjoy it, it is also a waste of time but what else were you going to do ?
I think you should add some code examples. The stuff for reading cells in a column using openpyxl would be really useful for people (it isn’t obvious how to do that from the docs).
Also the sqlalchemy stuff for creating and updating tables.
Good point, I created this Gist with the various different stages of code from the original at 55 (v1) seconds all the way down to 13.5 (v4)