A little off topic but if anybody would know... [Archive] - ShowEQ Open Source Project Message Forums

View Full Version : A little off topic but if anybody would know...

Indifel

12-02-2002, 06:57 PM

Since the ShowEQ developers around here have probably seen more of the EQ datastream than they care to remember, I figure if anybody has the answer to this question, then you guys certainly would.

Anyway, there have been numerous reports on the EQ tech support boards of a problem regarding packet loss that goes away when you /say, /gu, /who all, etc. Tech support is pretty steadfast in their belief that it's a network issue (which it isn't) or some configuration issue (which it also isn't). Testing and the programmers, if they've even heard of the problem, probably refuse to believe it's on their end, since it's been happening for months (with a huge increase since PoP was released).

I'm just curious if anyone around here has had the chance to inspect the EQ datastream during one of these fits and gathered any information on what the cause might be (or at least what the symptoms look like). I'd do it myself, but my network setup right now (wireless down to a NAT router, shared amongst several people, and not in my apartment) makes it difficult.

Alternatively, if anyone has an idea for how I could check this out myself given my limited network options - and without running a packet sniffer on the EQ box, *grin* - I would very much appreciate it.

Ratt

12-02-2002, 08:25 PM

If you're talking about what I think you're talking about, I _believe_ the issue is with the client computer.

I've occasionally seen text get lost (having two computers side by side, one gets the /gu or /say or what have you, one does not).

Given the nature of the UDP communications, I'd say a packet is getting dropped somewhere... whether it's on the Sony side or the client side, I can't say, but the problem is basically with the protocol, in the end.

Indifel

12-02-2002, 11:51 PM

We're not talking the occasional packet missing here or there - we're talking total crippling loss of information. Can't cast spells, can't see what anyone is doing, can't see incoming text. The PL meter generally starts showing around 30% packet loss - not all in one block, but intermittent. Occasionally it clears up on its own, but usually it progresses right on up to 95% or so, just enough to stay connected for a couple minutes. The server doesn't set the character as linkdead for quite some time.

It seems to happen most often in crowded zones with both high throughput and heavy graphics load. For example, standing in bazaar, you can get the packet loss, hail or /who all to make it stop, and it's okay until you rotate, bringing other characters (and their accompanying new textures?) into view.

If any application starts a new thread on the client machine, the packet loss stops instantly. It usually comes back after a little while, though.

My guess is that there is no loss of packets along the way, but either the network device is failing to receive them, the kernel is failing to handle the network device, or EQ is somehow tossing the packets or letting a buffer overflow somewhere.
Anyway, I thought maybe somebody around here had seen this problem in action and had the chance to observe the nature of the packet loss to determine whether there really was any. I thought maybe EQ was somehow pre-empting driver routines, but there's really no way for me to test that.

rencro

12-03-2002, 12:08 AM

I've had this problem, I have a Linksys Router BEFSR11, with firmware version 1.43 dated Sep4,2002. Anyways, once I updated to this firmware I would have the same issue you describe and a "bandaid" was to set the routers "MTU" from 1492 down to 1092. But its something SOE changed recently that caused these problems to creep up, before I upgraded the firmware I would LD every 20 min, this was around the time of the PoP patch that broke SEQ. Before the patch everything was GREAT!!.

edit: BTW, problems only on win98 - me . xp was fine the whole time.....

datadog

12-03-2002, 02:05 AM

I threw both of my Linksys in the garbage.

Im using a Siemens Speedstreem now..

only problem i have is that under situations where I have alot of text turned on.. (i.e. bazaar, /auc, etc) i will drop an occasional character out of a text message in /gu /g or /say, etc...

Funny thing is.. i can have two computers side by side seeing exactly the same text.. one gets errors once does not. And its not always the same computer..

/shrug

MisterSpock

12-03-2002, 09:48 AM

The Linksys routers, as nice as they are to setup and manage, do seem to have some serious issues. Ironically, I went with one of the Siemens units, too, but had a terrible experience with it.

Anyway ==

One thing that I've noticed: If you are dropping characters in chat, OOC, etc, it seems to be related to the "bad word" filter. For me, at least, turning off the bad word filter stopped the character-dropping problem completely.

If I thought they'd lift a finger to fix it, I'd /bug it...

fryfrog

12-03-2002, 11:01 AM

i got one of the linksys routers the other day (replaced my ipcop firewall). it was the 4 port switch with built in WAP. can't remember the serial #. i updated the firmware, but was having some LD issues. i'd of course known about the linksys problem so i just searched the eq forums...

the port triggering fix is what made it work for me, but a lot of people don't understand it.

goto advanced, then "forwarding" then click "port triggering"

chose a name "EverQuest" (doesn't really matter) and put the first and second rage as "1024" ~ "65535".

what this does is anytime ANY computer makes a request on ports in this range, the router sets up a temporary port map to the computer that requested the port. what is happening (i think) when you get the 95% pl is this:

udp is a stateless connection, initiated by the client (why nat works). when something happens and the server and client lose sync, the SERVER tries to initiate a connection BACK to the client... this won't work for people behind NAT firewalls. port triggering seems to fix this.

i have not had any LD issues after using port triggering. would be nice if they would narrow the RANGE down some, i've heard someone used just 1024-7000 or so though. i just now changed my mtu to 1492... but i believe that all the clients on the network need to have the same MTU or lower, no? otherwise the person with a 1092 MTU on their router is going to be creating a MASSIVE number of fragments.

lane

12-03-2002, 04:45 PM

Using /hail has been a long standing say to "get a connection back". I started using it about a year ago when someone started hailing the crap out of a mob we were killing and I laughed at them. They told me why they did it and it does seem to work quite a bit.

As for opening ports from the outside... Umm can you say security breach? You are now allowing anyone to spam your computer. I do agree with the theory, it's just that the fix seems worse since it looks like you are also opening up TCP for connections from the outside also.

-Lane

fryfrog

12-03-2002, 05:31 PM

lane, you mis-interpret the "port triggering". it isn't opening a massive port range. lets say you use 4 eq computers, each one connects out on a different random port. the linksys router then creates a back-port forward for ONLY those 4 ports that were triggered.

Indifel

12-04-2002, 03:05 PM

The port triggering fix does indeed take care of one particular PL problem with that version of the Linksys firmware. After about eleven minutes or so, the router will drop the NAT entry for your UDP traffic, which causes an instant stop in any traffic between you and the EQ servers.

That's different from what I'm talking about, which is intermittent and is not related to router/NAT issues at all, since a symptomatic treatment is entirely self-contained within the afflicted machine.

In any case, thanks much for trying :) Take care....

FlumberBum

12-06-2002, 03:52 PM

I have this exact same problem (As described by Indifel), and I've never been able to find a working solution. Last night I tried the /hail idea, and I'll be damned, but it works. It doesn't permanently fix the problem, but it does fix it for a little while.

For what it's worth, I also have Linksys, and I regularly play with 2 systems, and the problem has never shown up on the second system, only on the first system. I'm at a loss, but hearing that it's happening to someone else finally gives me some relief that it probably isn't something I've done wrong. That or we are both doing the same thing wrong :)

datadog

12-06-2002, 06:57 PM

I'd be interested to know what your filter settings are like.

Same on both? /filter badword is on or off?

What about /servervilter? on or off? same on both?

The Mad Poet

12-06-2002, 08:23 PM

Linksys has problems with thier newer firmware - I downgraded my router to an older firmware and this fixed the 'ld every 15-20 mins like clockwork' problem

The wierd packet loss with fixed by Hail - I'm not sure what causes that - but I do notice that if you disconnect 100% then reconnect that it will go away.

Also if you start to experience the problem it will follow you through zones for your entire EQ session.

Personally I think it is a router problem on the Sony side - and it's just luck which router you get onto when you start the game - which is why the problem is so random.

And yes 2 computers side by side *can* get different routers on Verant's side when you play.

Because it's not across the computers on a local lan - and it's random not every time - this all points to a problem at Sony.

datadog

12-06-2002, 08:26 PM

And yes 2 computers side by side *can* get different routers on Verant's side when you play.

huh ?

Nurseling

12-06-2002, 11:29 PM

but only with win 98 after upgrade both computers on my network run great

The Mad Poet

12-07-2002, 12:02 PM

huh ?

When you start a connection to EQ you are routed through the internet - the client starts a connection and connects to the server.

Verant has load balanced routers comming in through multiple connections to the internet.

For instance..

Sprint backbone ->

T1 -> verant
T1 -> verant
T1 -> verant

Each of those T1's have a router on them hooked into Verants network.

When you start the EQ client on one machine - you get the first T1 and the first router

When you start EQ on the second machine the routing logic that works between Sprint and verant will decide the first T1 is overburdended and put your second computer on the second T1 - even though it is comming through the same IP - it is a different UDP port and a different connection.

Once you start the client you are stuck on a router until your restart the client.

SO - if the second T1 has some connection problem or is dropping packets - it is possible for your first computer to be fine - and your second computer to go LD or have issues with lag.

Expand this to include many more T1's to other Internet backbones and then when a bank of routers or a backbone connection goes down - that is why sometimes you will see 10 or more people in a raid with you go LD all at the same time.

Sometimes on the more severe problems you will see like 1/2 or more of your server go down - when you could do a /who all COUNT a few times I saw the server go from 2400 people online to 300 ... this is a major backbone dropping connection or Verant having a major routing issue.

To restate:

Some routers at verant have problems - that is why sometimes you get wierd lag/packet loss and rebooting will fix the issue. It is also the reason for the packet loss /say fix.

datadog

12-07-2002, 02:38 PM

To restate:

Some routers at verant have problems - that is why sometimes you get wierd lag/packet loss and rebooting will fix the issue. It is also the reason for the packet loss /say fix.

Do you KNOW this to be true? or are you speculating based on the acecdotal evidence you have seen?

Do you know what type of routers they are using? Who their ISP is? What about the firewalls?

How does using /say have any effect on which "router" you are using ?

BlueAdept

12-07-2002, 06:49 PM

I get that bad lag occasionally. Ill get a PL of up to like 90%. Do a /hail or /say and boom...its fixed for a little bit.

I have not changed anything on my system in over 6 months. It has to be something with them.

Mr. Suspicious

12-07-2002, 07:55 PM

It has to be something with them.

There's no such thing as "me" and "them" when it comes to internet. Think of the 1,000 devices your signal is going through after it leaves "your end" and before it reaches "their end" and vice versa.

One station is YOUR isp (could be considered "your end" by "them") Are you sure your ISP doesn't have a hickuppicking router somewhere? Are you sure your ISP's backbone provider (could also be considered "your end" by "them") doesn't have a hickuppicking router somewhere? Are you sure non of the TransOceanic Communication Providers (the "limbo") have a hickuppicking router somewhere?

baelang

12-07-2002, 08:00 PM

If it isn't you, then it is them...OR the other guys... OR perhaps someone ELSE.

BlueAdept

12-07-2002, 09:07 PM

Originally posted by Mr. Suspicious

One station is YOUR isp (could be considered "your end" by "them") Are you sure your ISP doesn't have a hickuppicking router somewhere? Are you sure your ISP's backbone provider (could also be considered "your end" by "them") doesn't have a hickuppicking router somewhere? Are you sure non of the TransOceanic Communication Providers (the "limbo") have a hickuppicking router somewhere?

Since I am not the only one having this problem, since I have talked to people in my guild who are from the US east coast, from the US west coast, one person from England and one person from Australia and they all have had the similar problems, I find it hard to believe that all of "us" have problems with our ISPs and/or problems with the ISP's provider and/or TransOceanic Communcaiton Providers all at similar times and with similar solutions (ie /hail or /say).

Think of the 1,000 devices your signal is going through after it leaves "your end" and before it reaches "their end" and vice versa

Since there is evidence that other people are experiencing the same problem from different locations from around the world, it seems logical to arrive at the conclusion that the origin of the problem seems to be closer to the point of convergence rather than to each individuals location.

Without getting traceroutes from some of the people that are having the problem, when they are having the problem, it would be impossible to absolutely exclude the possibility that it is on "my end".

I am not trying to start an argument, but I do have a grasp on how the Inet works. I am a network engineer for a large resort. Resume and references are available upon request. :)

devnul

12-09-2002, 05:08 PM

This problem started back in April or so, and it affects lots of people that never had problems before. It's definitely something they did.

Some people have reported being hosed with packets from SOE during these times.

I suspect there is two related issues. One Linksys routers of a certain firmware losing the mapping.

The other is something they started doing about the time they stopped sending packets for non moving toons. Some load balancing or something.

Both seem the same, one is because the router forgot that a port should be open. The other is because SOE changed the port on the fly. Your client knows the new port but has not used it outbound. (so incoming traffic is blocked) It may even be requesting data repeatedly on one port and looking on another port that never comes.

So when you /h or say something then packets go out and if they go out the new port the router knows to let packets in on the new port and all is well. If the port change is not the one your hail goes out on then it doesn't fix it. And it doesn't always fix it.

That's my theory anyway.

Or would be exept that tech support says there's no problem. We all have ISP's with identical glitches that just coincidentally kicked in after one EQ patch.

dn

Indifel

12-13-2002, 02:00 PM

Okay, to dispel any notion that it's a Verant server issue....

I run a little web proxy (AnalogX Proxy) on one of my desktop machines so that I can proxy websurf via my laptop through it.

If I get the hail-/who all fixable packet loss, something else that fixes it is if I load a web page through that proxy.

It works every time retrieving a variety of remote pages (Allakhazam and the like, most often), although I haven't checked it out by loading a page served from my LAN through the proxy.

rencro

12-13-2002, 03:17 PM

I thought I answered this, turn down your MTU to 1092 problem SOLVED!! Until VI fixes the problem on their end. Let me repeat.... SOLVED!!

Indifel

12-16-2002, 03:31 PM

Considering that many people are posting to the tech support forum that they aren't using a router on their LAN at all, that solution won't work.

rencro

12-16-2002, 06:26 PM

huh? How about setting it individually.. No too easy, keep at it though.

BlueAdept

12-16-2002, 11:49 PM

Aye...I have no router.

KaL

12-17-2002, 08:49 AM

You don't need a router to set your MTU.

http://www.starpower.net/support/pc/faq/system/win98/how.to.manually.adjust.mtu.html

Mismatched MTU settings on routers is a very common problem. I worked for a national ISP as a network engineer for several years. MTU settings came back to bite us on the ass a few times ;)

Indifel

12-17-2002, 09:45 AM

There is also an interesting thread

http://boards.station.sony.com/ubb/everquest/Forum3/HTML/034574.html

where somebody else discovered that his machine was receiving an inordinate amount of UDP traffic which appeared to be sourced from the EQ servers.

Hey, what do you know, this brings me back to my original question, which was " I'm just curious if anyone around here has had the chance to inspect the EQ datastream during one of these fits and gathered any information on what the cause might be (or at least what the symptoms look like)."

Indifel

12-19-2002, 03:05 AM

Well, I bit the bullet and reconfigured my machines to use one as a NAT router (so now I'm double-NATed, hehe) for the other, so I could use tcpdump to examine the packet stream.

I feel so dirty ;)

But anyway, I have discovered that what the guy posting to the official boards said is true - people experiencing this problem are indeed getting spammed by the EQ zone server. I dumped some packet captures to a file and took a peek, and using some of the source code from EQemu and ShowEQ as a guide (for packet format and packet opcodes), discovered the following:

During the packet loss, the EQ zone server is flooding the client with an abundance of large (anywhere from 150 to 500+ bytes, not counting UDP/IP headers) packets. They appear to be legitimate EQ content packets, near as I can tell. In addition, every one of these packets carries the same EQ opcode: 0x6b42, which the EQemu and ShowEQ sources list as a "new spawn" packet.

Even more interesting... these packets are the same ones, over and over and over.... The spacing between a packet and each of its duplicates seems to be about 230ms. The packet capture I was examining only repeated four packets (and still accounted for most of the packet loss), but if you could imagine that happening with 10 or 100... well, that's a lot of bandwidth ;)

(As an aside, am I correct to assume that these are among the packets which are encrypted?)

Anyway, there are also flags set in these packets which indicate "acknowledge requests"; honestly I'm not sure how these are used, and there don't appear to be subsequent packets from the client responding to the request. (That could just be me not understanding how it works.) *But*, shortly after submitting a "/who all" request to the server, the client *does* send a packet with its ack response set, but in response to a different packet (not one of the repeated new spawn packets). Then, strangely, the packet spam stops, and everything starts behaving normally.

It should be noted that the "/who all" packet sent by the client also has *its* ack request bit set.

In addition, each packet includes a sequence number. The number continues to increment for each packet, including the oversized ones, which leads me to believe that EQ is dropping these packets based on bad/unexpected content rather than not receiving them at all.

Finally, I would really like to find out whether (A) there is a client request that causes this packet spamming, or whether (B) it is a failure of the client to respond to an ack request (normally, it appears to respond to an ack request within 50ms or so, using the number of the last ack request it receives), or whether (C) it is just misbehavior on the part of the server. (My guess is "B", or perhaps a combination of "B" and "C".)

If anybody knowledgeable about this stuff has any further insight, I'd much appreciate it. And if anybody knows who I can contact at SOE about getting this resolved, without getting my account banned (hey, you never know) for peeking at packets, I'd appreciate that too :)

(Edited to clarify the 230ms spacing)

high_jeeves

12-19-2002, 09:42 AM

I would think B... the only reason the server would have to resent an "important" packet is if it never received an ACK from the client. I would guess that the problem is actually either in the client, or in the network route between from the client to the server.

--Jeeves

Indifel

12-24-2002, 06:54 PM

Nobody else interested in this problem, or the rather unusual behavior on the part of the server and client? Or does this place not actually have any "hacking" spirit anymore?

high_jeeves

12-24-2002, 07:32 PM

I have never, in my almost 4 years of EQ had this problem, with a number of different networks and machines, had this problem.. so its not real easy for me to take a look at it...

--Jeeves

Indifel

12-25-2002, 03:48 PM

I understand that not everyone has had this problem - but that could be fundamental to the cause as well. I already know what the symptoms look like now, and was hoping for speculative opinion more than anything else....

In any case, high_jeeves, thanks for your insight a few posts up. I'm kind of leaning toward B as well :)

KaL

12-26-2002, 10:22 AM

Just to add my experiences to the pile.

I used to have this issue in Michigan, using Road Runner cable. I had 3 computers hooked up to a 10-mbit hub going to a Linksys router (I don't recall the specs, but it only had one port besides the broadband port). 2 active EQ accounts at one time.

I get the same problems but only very infrequently with the same 10-mbit hub in North Carolina using Time Warner cable (also Road Runner) with a D-Link router (DI-704). 3 active EQ accounts at one time.

But, using that 10-mbit hub on a Linksys router (4 port model) in Virginia through Comcast, I didn't get this problem at all. 1 active EQ account.

I'm thinking it has something to do with more than one EQ player on a subnet, using IP masquerading.

KaL

Edit: Forgot to add that doing a /say or /g or something almost always clears it up.