Hardware Canucks

Hardware Canucks (http://www.hardwarecanucks.com/forum/)
-   O/S's, Drivers & General Software (http://www.hardwarecanucks.com/forum/o-ss-drivers-general-software/)
-   -   Help! Need some programming expertise to convert txt to csv (http://www.hardwarecanucks.com/forum/o-ss-drivers-general-software/46417-help-need-some-programming-expertise-convert-txt-csv.html)

Dead Things September 13, 2011 09:13 AM

Help! Need some programming expertise to convert txt to csv
 
Not sure if this is the right place to post this, so mods feel free to move it if there is a more appropriate home for this thread...

I will start off by saying that I have zero programming know-how. As Neytiri would say, I'm like a baby, making noise, don't know what to do.

Avatar You're Like a Baby - YouTube

Anyway, I have several large text files containing twitter data. Here is a small excerpt of the data to illustrate the format it is in:

Code:

epaulnet: Nanaimo Daily News: #Qaddafi backed efforts against al-#Qaeda: DND reports‎ http://t.co/iTVFkqh #Libya
04/23/2011 retweet

omar_chaaban: http://bit.ly/gBFpNo Protesters arrested at Libyan Embassy #Libya #Canada
04/05/2011 retweet

kyleharrietha: The #ICC has evidence Gadhafi's gov't planned to put down protests by killing civilians - http://bit.ly/gIQCZk #Libya
04/05/2011 retweet

epaulnet: The Canadian Press: 16 #journalists, including one Canadian, are missing or detained in #Libya http://t.co/KtVIlTg #Canada #CPJ #NATO #War
04/26/2011 retweet

jdalrymple: Apparently the Canadian air force is now doing flyovers in Libya. Canada has a plane?
04/07/2011 5 retweet

nedhamson: Canada mulls call for more jets for Libya campaign http://ff.im/-BaJjf
04/14/2011 retweet

vancouversun: Canada must wait to hear costs of Libya war http://bit.ly/hydYMu
04/07/2011 3 retweet

gpollowitz: Trump in a nutshell: America sucks now, Bush is the worst president ever, Obama is the worst president ever, invade Libya, Canada's HC 4 all
04/18/2011 retweet

vancouversun: Canada mulls call for more jets for Libya campaign http://bit.ly/hUsJGu
04/14/2011 retweet

weddady: Kaddafi's pr strategy..to give speeches at 3AM.. was he trying to compete for ratings... in the US & Canada? #Libya
04/29/2011 10 retweet

newsmanly: Canada should not be in Libya - Windsor Star http://dlvr.it/P5d70
04/20/2011 retweet

I'd like to convert this data into csv format so that I can import it into SPSS (or Excel if need be) - but I have no idea how to do that. In total, I have around 70,000-ish records, hence the need for an automated process. In the end, the SPSS data fields I would like to end up with are:

Author - defined as everything preceding the first colon on the first line of each record.
Content - defined as everything following the first colon on the first line of each record.
Date - defined as the XX/XX/XXXX content on the second line of each record.
Retweets - defined as the number following date but before the word "retweet" (NB: when 0, this number is absent from the record).
Links - defined as any and all urls appearing in the "Content" field
Link# - the number of urls appearing in the "Content" field

Can it be done? Can anybody help point me in the right direction? Any guidance would be much appreciated! If I can offer anything in exchange for your help, please let me know. Thanks!

edit - Source txt data is encoded in UTF-8, by the way.

Arinoth September 13, 2011 02:12 PM

It possibly could be done, hell I could probably figure something out to do this if i had the spare time to do it.

I'll try to look into it for you, however I have a lot on the go right now so I may not be able to guarantee you anything.

ilya September 13, 2011 02:34 PM

Have you tried simply opening adding comma separators via script and opening the file in Excel? It might be a lot easier than you think since the format is so consistent.

Dead Things September 13, 2011 03:26 PM

Quote:

Originally Posted by ilya (Post 549006)
Have you tried simply opening adding comma separators via script and opening the file in Excel? It might be a lot easier than you think since the format is so consistent.

With my skillset, the only way it could be easy for me is if each field was a fixed width so that I could sub in delimiters at specific intervals. But since both author and content are variable-width fields and since the retweets field occurs only occasionally, I would have to use some sort of programming logic to tell it where to look for the appropriate places to insert delimiters. And that is well beyond my capabilities, I'm afraid. I tried reading "Sed - An Introductory Tutorial" last night and was immediately overwhelmed by the urge to punch babies.

Keywork September 13, 2011 05:01 PM

This can be done in Perl rather quickly and efficiently. Can you post up an example of the result you want? (like actual data so if anyone helps out here, they know it's actually correct). I need to finish up some things for school but I can probably mock up a quick script today or tomorrow if you're interested!

Dead Things September 14, 2011 03:58 PM

1 Attachment(s)
Very cool Keywork! And yes, very interested! Thanks! I've attached an example of the output I'd like using the same data samples as posted above for reference.

grinder September 14, 2011 04:24 PM

MS Access (Microsoft Office Pro) would handle this for ya too.

Arinoth September 14, 2011 06:18 PM

Keywork, lemme know if you're doing it or not, otherwise I'll throw my hat into the ring and try to work out a little program in my native C++ language

Keywork September 14, 2011 07:04 PM

Bleh C++. I'm finishing up an article right now and then i'll spend some time on it! Shouldn't take long! I'll post back here in a few.

Arinoth September 14, 2011 07:06 PM

Quote:

Originally Posted by Keywork (Post 549328)
Bleh C++. I'm finishing up an article right now and then i'll spend some time on it! Shouldn't take long! I'll post back here in a few.

C, C++ and C#, a real man's language, well I'm going to attempt to ever program in assembly ever again


All times are GMT -7. The time now is 02:05 AM.